Creating 3D Models from National Library Search Results: first attempts

Notre Dame 3D model demonstrated by Blaise Aguera y Arcas

The recent burning of Notre Dame reminded me of a 3D model Blaise Aguera y Arcas demonstrated many years ago. The model was made from images obtained from flickr on mass (pun intended). Computers have gotten faster since 2007 [citation needed], and we now have the National Library of Australia’s Trove aggregation portal. What if I could use large search results from Trove to construct 3D models of Australian landmarks? It should be much easier to do nowadays! It would be a great way to make assets for games and other nifty things.

To get me kick-started, I went to the Trove section of the GLAM Workbench:
https://glam-workbench.github.io/trove/. This helped me to formulate a plan for my code:

  1. Conduct a search on Trove for photographs of the Sydney Opera House.
  2. Find any links to image thumbnails in the search results.
  3. Save a list of the links so I could bulk download them, then process them into a 3D model.

As such, in Python I needed the following libraries:

import requests
from bs4 import BeautifulSoup
import csv
import time

The requests library allowed me to scrape information from the information super highway! BeautifulSoup allowed me to turn whatever I roadkill I scraped into something more digestible. The csv library helped me save it for later. The time library was so I could appropriately label what I obtained with a used by date. Did I just fail to explain what’s going on for the sake of a consistent metaphor? Why yes I did, thanks for noticing. Don’t worry, it will all get explained as we continue in the code.

First, I needed to set all of the parameters to send to Trove in order to get the search results I was after in a nice format:

api_key = '' # You'll need to register for your own and fill it out here
api_url = 'https://api.trove.nla.gov.au/v2'
api_search_path = '/result'

search_for = 'Sydney Opera House'
search_in = 'picture'

params = {
 'q': search_for,
 'zone': search_in,
 'key': api_key,
 'encoding': 'xml',
 'l-availability': 'y/f',
 'l-format':'Photograph'
}


Let’s break that down. The api_key is needed in order to get information from Trove. Think of it as your library barcode/number for online stuff. You can read all about it at the GLAM Workbench Trove section and just like a physical library card you’ll need to register for one. No, you can’t just use my library card, I’ve already borrowed too many books!

The api_url and api_search_path are where online we can get information from the National Library. If you’ve got a keen eye, you may have noticed the v2 at the end of the api_url. This (presumably) stands for version 2 and at some point there’ll be a new version and a version after that at which point we could go and change the code. This reflects the mentality of all of the code in this segment: they’re all kind of configuration variables. They can be changed to get different results and everything that can be changed has been lumped together at the start of the code in order to make it easy to find and configure differently. It would be a pain to have to search through the code to find all these different parts.

The search_for and search_in can also be changed to get different results. Again, they are configuration variables. Don’t want to search for ‘Sydney Opera House’? Change search_for to ‘Melbourne Opera House’. If you’ve seen Trove API code before, you might know that search_in is usually set to ‘newspaper’ – a reflection of the wealth of digitised newspapers available. But we’re interested in the ‘picture’ zone of Trove because newspapers tend to have text and pictures which can be hard to separate automatically. I’ve been warned against such dangerous zones in lore and legend.

Finally, the params (short for parameters) part of the code puts some of the configuration variables in an appropriate format to be sent to Trove. Think of this as filling in a physical form. Perhaps a freedom of information request. It must be in the right format or the request will be ignored. But you shouldn’t get hung up on the bureaucratic requirement or you’ll loose your shit. Just go with it. You might have noticed the l-availability parameter. This is something I reversed engineered from the regular Trove searches. To explain how, here’s the URL for a search for Sydney Opera House pictures:

https://trove.nla.gov.au/picture/result?q=Sydney+Opera+House

Notice how q is equal to Sydney+Opera+House? Load the URL and look at the webpage. Now change Sydney to Melbourne and reload the page:
https://trove.nla.gov.au/picture/result?q=Melbourne+Opera+House. Notice how the Trove search box now contains “Melbourne Opera House” rather than “Sydney Opera House”? We changed the q and the search changed. Hmmm… q must stand for Query! But hang on, there was a q parameter in that bureaucratic form we filled just filled out. My intuition told me there might be other parameters out there that I could use. Looking at the regular Trove search pages was how to find them.

On the left hand side of Trove searches there are various options to “Refine your results”. For example, you can refine your search to look for only particular decades. You can also look for only “freely available” content. Clicking on this takes us to a new URL with refined results:

https://trove.nla.gov.au/picture/result?q=Sydney+Opera+House&l-availability=y%2Ff

Our q is still the same. But a l-availability=y%2Ff has been added to the end of the URL. Bingo! That’s the parameter that I added to our code to only return freely available search results. Looking through the search results I soon found pictures that were not exactly a kind of photograph:

open day Sydney Opera House  (40)

Perhaps it’s more of a poster? The “Refine your results” has a Photograph option which changes the URL again:

https://trove.nla.gov.au/picture/result?l-availability=y%2Ff&q=Sydney+Opera+House&l-format=Photograph

So l-format could be used in the code to keep out images that would not really help in the production of the 3D model, such as the above image and even images in the form of paintings and cartoons. Further refinement could be on the table for later.

Requesting the data

With the configuration configured and our parameter refined it was time to load up the data for our first page of results:

response = requests.get(api_url+api_search_path, params=params)
print(response.url)
data = response.text

The first line puts together a URL with all the parameters and sends them off to Trove. This is just like loading a webpage. In fact, the next line prints out the URL which was requested:

https://api.trove.nla.gov.au/v2/result?q=Sydney+Opera+House&zone=picture&key=REDACTED&encoding=xml&l-availability=y%2Ff&l-format=Photograph

You can see that I redacted my key (i.e. my library card number – get your own!). You can see our other parameters there too. Loading that URL in a web browsers results in a file in the XML format. It looks something like this roadkill and is the same contents that you’ll find in the data variable above:

<response><query>Sydney Opera House</query><zone name=”picture”><records s=”*” n=”20″ total=”7712″ next=”/result?q=Sydney+Opera+House&encoding=xml&l-availability=y%2Ff&l-format=Photograph&zone=picture&s=AoIIRiOr2StzdTIzMTc0MjA4Mg%3D%3D” nextStart=”AoIIRiOr2StzdTIzMTc0MjA4Mg==”><work id=”231758577″ url=”/work/231758577″><troveUrl>https://trove.nla.gov.au/work/231758577</troveUrl><title>Sydney Opera House</title><contributor>David Chung</contributor><issued>2008</issued><type>Photograph</type><holdingsCount>1</holdingsCount>

…Yuck!

Yep, roadkill alright. Time to cook that up into something useful. It’s BeautifulSoup time:

soup = BeautifulSoup(data, features="xml")

for thumnail in soup.find_all('identifier', {"linktype": "thumbnail"}):
  print (thumnail.get_text(" ", strip=True))

The soup is taking our data and heating it up. I’ve set it up so that a particular nice part (the links to the thumbnails) bubbles up to the surface and is displayed for all to see. These are the identifier elements which have linktype attributes of thumbnail:

<identifier type=”url” linktype=”thumbnail”>

The output of our code at this point is a list of links for the first page of results from Trove. Good progress!

Next time…

We still need to save the output of our code. We still need to scrape and heat up the next pages of results from Trove. We can then use the saved output of our code to download all the images because we have the URL of all of the images! Then we can turn these images into a 3D model!