Pages of Results from Trove

Previously on this blog…

Chief of police: You’re mad Harrop, nobody’s ever taken search results from the national library to automatically create 3D models.

Harrop: I can damn well try. I did in part one, part two and now I’m going get multiple pages of results! I get results goddammit!

Chief of police: Did you just try to hyperlink in spoken conversation? That’s not how talking works Harrop. I’ll have you’re damn badge for this! Do you think this is a game? Do you?

Harrop: It has game development applications, for sure. Here, take my badge! I live by a code anyway.

For reference, here’s the complete code from last time:

import requests
from bs4 import BeautifulSoup
import csv
import time

api_key = '' # You'll need to register for your own and fill it out here
api_url = 'https://api.trove.nla.gov.au/v2'
api_search_path = '/result'

search_for = 'Sydney Opera House'
search_in = 'picture'

params = {
 'q': search_for,
 'zone': search_in,
 'key': api_key,
 'encoding': 'xml',
 'l-availability': 'y/f',
 'l-format':'Photograph'
}

timestr = time.strftime("%Y%m%d-%H%M%S")
save_file_name_and_location = timestr + "_" + search_for + ".csv"
theNewData = open(save_file_name_and_location, 'w', encoding='utf8')
csv_writer = csv.writer(theNewData, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL, dialect="unix")

response = requests.get(api_url+api_search_path, params=params)
print(response.url)
data = response.text

soup = BeautifulSoup(data, features="xml")

for thumnail in soup.find_all('identifier', {"linktype": "thumbnail"}):
  print (thumnail.get_text(" ", strip=True))
 csv_writer.writerow([thumnail.get_text(" ", strip=True)])

theNewData.close()
print ("That's all folks!")

If we load up the first page of results we can find at the start of the XML the location of the next page of results:

<records s="*" n="20" total="1403" next="/result?q=sydney+opera+house&encoding=xml&l-availability=y%2Ff&l-format=Photograph&zone=picture&s=AoIIRlYrOCtzdTIzMTY2MjQwNg%3D%3D"nextStart="AoIIRlYrOCtzdTIzMTY2MjQwNg==">

The contents of the next attribute can be combined with the api_url variable to derive the full URL of the next page. The contents of the next attribute can be obtained with BeautifulSoup:

firstNextPage = soup.find('records').get('next')
Which can then be used to make a reasonable request for the needed page:

responseSubsequent = requests.get(api_url+firstNextPage)
print(responseSubsequent.url)

But loading that URL gives an errors message:

Forbidden

Invalid API Key

You can get technical details here.
Please continue your visit at our home page.

Well shit. And there I was one day from retirement.

If we take a look at the parameters of the URL we made, we can indeed see that the key parameter is missing. Trove leaves it out. You may recall that the key parameter was kind of like a library card number - needed for "borrowing" data from the national library and a number that you want to keep secret lest other people "borrow" too much on your behalf. So it makes sense that Trove leaves it out. We stored the key in the variable api_key so we can easily add it to our request. There's no need to add any of the other parameters because they're already there and we'd end up with duplicates:

firstNextPage = soup.find('records').get('next')

params = {
'key': api_key
}

responseSubsequent = requests.get(api_url+firstNextPage)
print(responseSubsequent.url)

We could continue on and code for extracting the links and getting the page after this second page in the same way we got the first page:

mySoup = BeautifulSoup(responseSubsequent.text, features="xml")

for thumnail in mySoup.find_all('identifier', {"linktype": "thumbnail"}):
  print (thumnail.get_text(" ", strip=True))
  csv_writer.writerow([thumnail.get_text(" ", strip=True)])

veryNextPage = mySoup.find('records').get('next')

And we could continue on, copying this code and changing it to get the third page. And copy and paste and change the variable names to get the fourth page and so on. Copying and pasting code again and again is a pretty good indication that a function is needed. We can put everything in a function that calls itself with the details of the next page:

def nextPages(theNextPage):

  params = {
  'key': api_key
  }

  responseSubsequent = requests.get(api_url+theNextPage, params=params)
  print(responseSubsequent.url)

  mySoup = BeautifulSoup(responseSubsequent.text, features="xml")

  for thumnail in mySoup.find_all('identifier', {"linktype": "thumbnail"}):
   print (thumnail.get_text(" ", strip=True))
   csv_writer.writerow([thumnail.get_text(" ", strip=True)])

  veryNextPage = mySoup.find('records').get('next')

  print ("***")
  nextPages(veryNextPage)

nextPages(firstNextPage)

The very last line calls the function for the first time with the details from the first page of results. So we still have some repeated code: the function and the code to handle the very first page. Ideally we would write the function to handle all pages and delete the code to handle only the first page. This would make it easy to change the URL parameters later. But this works, so we'll go with it. We're experimenting. We don't want to get bogged down.

I'm also being a terrible private citizen here. I'm just making requests as fast as possible to Trove. This is not very nice to their servers. I should put in some kind of delay between each request so they don't get overwhelmed. A fraction of a second. But at the moment I'm not making that many requests, so we're cool.

The final point to note about the code is that it always assumes there is another page. It will crash if there is not. Eventually it will crash because it hits the limits of "borrowing" on API key (my library card). Again, this is not an issue because I'm only after a few thousand search results. The file with the list of links will still exist if the program crashes halfway through. It will just have as many links as were done up until the point of the crash. This is an advantage of writing to the file one line at a time, like the crimes of this city: one on top of the next on top of the next until you think you're going to choke on the dirt of it all. Crimes that I'm going to clean up no matter what the chief of police says, goddammit.

To be continued...