3D Models from Trove searches

In a previous post I outlined the first steps in harvesting specific images from Trove in order to be able to create 3D models of public landmarks such as the Sydney Opera House. Let’s continue with that journey. The goals for our coding were originally:

  1. Conduct a search on Trove for photographs of the Sydney Opera House.
  2. Find any links to image thumbnails in the search results.
  3. Save a list of the links so I could bulk download them.
  4. Process the images into a 3D model.

But what we achieved was:

  1. Conduct a search on Trove for freely available photographs of the Sydney Opera House.
  2. Find any links to image thumbnails in the search results from the first page of results.
  3. Save a … didn’t do that!
  4. Process the images… didn’t do that yet either!

So let’s get to it by first saving the list of links to thumbnail images for the first page of search results. If we can do this for the first page we should be able to do it for all the pages of results. For reference, here’s our code from last time:

import requests
from bs4 import BeautifulSoup
import csv
import time

api_key = '' # You'll need to register for your own and fill it out here
api_url = 'https://api.trove.nla.gov.au/v2'
api_search_path = '/result'

search_for = 'Sydney Opera House'
search_in = 'picture'

params = {
 'q': search_for,
 'zone': search_in,
 'key': api_key,
 'encoding': 'xml',
 'l-availability': 'y/f',
 'l-format':'Photograph'
}

response = requests.get(api_url+api_search_path, params=params)
print(response.url)
data = response.text

soup = BeautifulSoup(data, features="xml")

for thumnail in soup.find_all('identifier', {"linktype": "thumbnail"}):
  print (thumnail.get_text(" ", strip=True))

We need to work out a file name to use to save our URLs. We could use just “file.csv” but including the search terms, date and time will help us stay organised. We already imported time in a bottle, so the first thing that I’d like to do is save every day in the form YYYYMMDD – HMS:

timestr = time.strftime("%Y%m%d-%H%M%S")

We can then use this timestr variable to construct a file name that includes the search_for variable:

save_file_name_and_location = timestr + "_" + search_for + ".csv"

This code feels very much like configuration. So a nice place for it would be with our other configuration variables. Logically, we need it after the search_for variable has been set. Just in case my obscure allusions above were lost on you:

We’ve also already imported CSV so we can write those nice spreadsheet equivalent files whenever we so choose. This may seem like a bit of overkill as we’re only writing out a list of links to thumbnails… which is pretty much a single column spreadsheet. But hey, you never know. I’m being experimental and we could easily end up needed to write out multiple columns of information. Let’s set up some code to allow us to write the CSV file:

theNewData = open(save_file_name_and_location, 'w', encoding='utf8')
csv_writer = csv.writer(theNewData, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL, dialect="unix")

Now if we want to write a row to the file with A as the value in the first column, B in the second and C in the third we would simply write:

csv_writer.writerow([‘A’,’B’,’C’])

That was easy as something. I can’t quite think of what it was easy as. It will come to me. Easy as… nope. It’s gone. Hmmm.

Previously we’d simply outputted our thumbnail URLs using the line:

print (thumnail.get_text(" ", strip=True))

The first parameter of get_text is a space and the second parameter removes (strips) any extra spaces from the start or end of the text that is got… in case you were wondering. You probably weren’t wondering that. I’m not sure how I’m managing to be socially awkward in a blog post. But there you go. Anyways, we can add a line after we print to also write the same text to our file, like so:

for thumnail in soup.find_all('identifier', {"linktype": "thumbnail"}):
 print (thumnail.get_text(" ", strip=True))
 csv_writer.writerow([thumnail.get_text(" ", strip=True)])

It’s a good habit to close the file once you’re done with it:

theNewData.close()
print ("That's all folks!")

And that’s done. As soon as you see That’s all folks! you should have a new file in the same directory as your Python code that contains the thumbnail links from the first page of results. And the next pages? And the 3D models?

Same bat-time. Same bat-channel…