I’ve been involved for a while now in a project to scrape Australian online news providers for news articles, headlines, bylines, timelines and topics to analyse. The project has come to maturity with the Australian federal election. It’s such an amazingly rich dataset and we’ll be able to interrogate it for all kinds of insights, from the broad campaign discourses to specific questions on specific issues such as a lack of federal support for the Australian game industry and how this is reported.
The Python based scrapers have been running all election and really haven’t had that many issues. They occasionally needed fixing when a news provider changed the structure of their reports, for example, moving a byline from the start to the end of an article.
We recently realised we had one major outage for about a day for a specific news provider (news.com.au). In circumstances like this it’s important to move quickly: online news stories can disappear so it’s best to snap them up as quick as you can, just to be safe. There’ll be opportunities to go back later and reconcile our quick snaffling efforts with our future efforts to scrape the missing data.
First Snaffle Trick: Twitter
The first snaffle is to get data from Twitter. There are plenty of tools out there to download all the tweets from a particular account for a given date range into a nice spreadsheet. Often news providers tweet out links to their news stories as they are published, and news.com.au is no exception. Unfortunately, news.com.au also appears to tweet links to content that is not of interest to us (i.e. not news reports). This can be filtered to a large extent in a spreadsheet. We can filter our links that don’t start with news.com.au/national or news.com.au/world and so on.
Saving all of the links to a single text file (called file.txt) with each URL on a new line allows us to easily bulk download them. For example, from the Terminal in Mac OS we can simply run the following command:
xargs -n 1 curl -O < file.txt
We’ll end up with downloaded files of HTML for each of the pages in the same directory as file.txt. We can write a script later to extract all the nice information from these files (headlines, bylines, etc.). Then another script to reconcile the files with our second set of snaffled data.
Second Snaffle Trick: Bing
Yep. Bing. I’m just as shocked as you are. But a few of the Google options that allow for a quick hack have been moved behind something of a paywall. So here we are.
We’re going to use Bing searches to work out a list of links to relevant news articles from when our scrapers were not working. In order to do this we need to reverse engineer how Bing works. Here’s a search for cats:
The q appears to be the contents search box. We can test this by changing cats to dogs and reloading the page:
Yep. That worked. Now let’s click on the second page of search results and take a look at the URL:
Hmmm. Don’t know what that FORM=PERE bit is, but that first=9 bit looks interesting. Maybe page 2 starts on the 9th search result (excluding ads)? I can’t be bothered counting. Let’s look at the URL of the third page of results:
Starting to see a pattern here? Fourth page URL:
9, 19, 29, 39, …
First time was a great time, second time was a blast, third time I fell in love. Now I hope it lasts
Okay, so we have worked out the pattern to get one page after another, perhaps using some Python code and a loop. Let’s refine our search to get the right stuff now that we know we can get all the stuff.
Bing allows for the searching of a particular website (or part of a website), just like good old Google. If we wanted to search for the exact phrase “cute cat” from news.com.au we would do the search:
"cute cat" site:news.com.au
This gives us the following URL:
Gee wiz. That added a lot of guff didn’t it? Guff is a word I use when I don’t know what I’m talking about. If I don’t know what it is, it must be stupid, right? So I’m just going to go ahead and delete the guff. I’m going to leave the q=… part because I’m familiar with that from our previous explorations:
Well that worked out well. The URL appears to have given the same search results as the gufful version. Playing around further seems to confirm this.
You might be worried about the %22 and %3A. Don’t be. Hug me instead. The short of it is there are some characters that can’t be in a URL, so they are replaced with %something. So the ” from our search box was replaced with %22 and the : from our search box was replaced by %3A. The space in between cute and cat was replaced by a + symbol. This is URL encoding and you can read all about it at W3Schools where there is a full list of what you can’t have in a URL and what to replace it with.
Wiser readers will be wondering how we can avoid only finding news stories about cute cats. This thinking is unwise. We only ever want cute cats.
I will end you.
Okay, okay. Maybe we don’t want only cute cats. I admit it. Taking a look at individual news articles it is clear that they always have date of publication in a form like MAY 12, 2019. Thus we can use this as our search:
This is going to return articles that have that date listed as publication, but also a few articles that just happen to mention “MAY 12, 2019”. Perhaps they were reporting on an event that occurred yesterday. Remember that we’re trying to quickly snaffle everything. We don’t really care if we get some false positives. We can filter those out later, perhaps from the metadata in the HTML of the pages.
You know what has now happened? You got the right stuff, baby. You’re the reason why I sing this song.
Looking at the search page results we are now faced with the prospect of having to extract the links from each page. Once we’ve got the list of links we can download using the same method in our Twitter snaffle. Gosh this seems like a hard task. I don’t wanna. I’m very lazy. I don’t want to wade through HTML that Microsoft generated. I’ve seen Microsoft Word generate HTML. I can’t… I just can’t risk it. I just. Damn. I mean damn.
We could instead turn our Bing search results into some structured XML. In particular, into an RSS feed by adding &format=rss to the end of the URL:
Which is some magic that I just googled. Not Binged mind you, googled.
While it might be tempting to write a script to get all pages of results, it’s probably sufficient to just get the first 10 or 20 pages. There are, after all, only so many news stories in a given time period. In Python we’ll need the ability to load URLs and also to search the results using regular expressions:
import re # regular expressions
Now let’s create a list called URLs and populate it with the first few pages of results based on all we’ve been able to find out about Bing URLs. More pages of results can be added later. Let’s just do five:
Now we can load up the URLs:
for url in urls:
response = requests.get(url)
print("Page of results: " + response.url)
data = response.text
listOfLinks = re.findall("<link>(.*?)</link>", data)
for link in listOfLinks:
For each of the URLs, the code loads the RSS from the link into the data variable. A regular expression is used to find all of the <link> elements. Some of these link elements will be metadata within the RSS feed (i.e. links to Bing rather than links to news.com.au). So we only output the links that start with “https://www.news.com.au”. Instead of printing the link, we could instead write the link to a file, one link per line. Then we can repeat the bulk download step from the first Twitter snaffle.
And that’s it, a quick way to snaffle up everything and then do some more systematic work later. Probably some manual work given it was only a short downtime.