Human Centred Programming/SQL: It’s all about Pedagogy

No, I’m not talking about Human Centred Design, or at least I don’t think I am. I’m talking instead about the writing of everyday code, code that in particular is going to be used by others. The code itself needs to be usable and maintainable. This is also true of the overall systems design and much has been written about that (for example, consistent and elegant metaphors in class design/structure).

In teaching people to code we do human centred programming all the time. When we teach a student variables, if statements and how to write their own functions we don’t give them examples with complex loops until we’ve actually taught them how to use loops. We don’t give them examples with classes until we’ve taught them classes (unless we are about to teach them a class on classes). We write code with the knowledge of students in mind… and then we sort of stop doing that. Too many coders ignore lifelong learning.

I was recently working with a researcher without a computer science background. They had worked hard to get a strong self-taught grounding in Python coding, but not in relational databases or in aspects of Python not directly relevant to their research, which is fair enough. They were faced with a problem of reconciling many CSV files. Each row of the CSV file contained columns:

  • ID
  • date
  • text
  • … and some other columns along for the ride

There were duplicate IDs across different CSV files. But the text and other columns would be different at different times/dates. They needed to wrangle the data into CSVs by date in order to have small files to work with and no duplicates. In the case of a duplicate ID, the row with the more recent date needed to be the row that was kept (i.e. most recent version).

A naive young programmer might recognise this as a database problem and insist on a database solution. Simply create a table with the appropriate columns, enforce ID as the primary key and then load the data in reverse chronological order (allowing the PK to enforce data integrity). Alternatively, don’t enforce the primary key, load all the data, and then query the database to only return the most recent record for each ID. This would be an SQL query something like:

SELECT id AS idAlias, date, *
FROM table1
WHERE date = (SELECT MAX(date) FROM table1 where id = idAlias)
GROUP BY id

Simple enough. But this wouldn’t be a people orientated approach. It’s a solution that relies on the non-computer scientist learning a new tool (databases) and, probably more problematical, maintaining that tool into the future. That’s a metric ton of effort. It drags them away from their core competencies. It is not mindful of the skills they already have. It designs a solution that is more revolutionary than evolutionary of their skill, meaning there’s not much pedagogical logic here (the same could be said of a noSQL style solution).

So instead I set about helping them to write a solution which used their existing skills and stretched them just a bit. Through data gathering the researcher was already familiar with CSV reading and writing, but not appending to an existing file:

output_writer = open(dateName + ".csv", 'a', encoding='utf8')

This enables us to maintain an array of IDs (called seen) that have already been written to the output files. Which will work all fine and dandy because we intend to input files in reverse chronological order. This acts as a de facto primary key which we can enforce in the code by simply doing nothing when we encounter a row with an id that has already been seen:

for input_file in input_files:
  theData = open(input_file, encoding='utf8')
  csv_reader = csv.reader(theData,delimiter=',')
  for line in csv_reader:
   #line[i] will be each column in the csv
   if line[4] in seen:
    # Do nothing. Only write it if it's unseen
   else:
    seen.append(line[4])
    csv_writer = csv.writer(output_writer,delimiter=',')
    csv_writer.writerow(line)
   theData.close()

Due to the large volume of files (size and quantity) there was a need for some extra code to read existing output into the seen list before loading new input files. There was also a bunch of extra code to handle date formats between the file naming conventions and the conventions used in different CSV files (an artifact of the data gathering process).

So there you have it. A little bit more effort on my part and some pedagogical thinking rather than a lot more effort by someone else (“Just go learn databases”).