What do words mean? How close are they to each other?

A friend of mine had been using a website to determine the similarity between words. This was informing some research on students learning new words. But the site went down and he doesn’t know when it will come back up. So he came to me.

This seemed like a good time to try using Google’s CoLab. It’s basically Google Docs, but for Python code. So you can easily share a Jupyter Notebook (or similar) with other people. Have it run in the cloud. Brilliant.

I decided to use ELMo, which is part of AllenNLP. Installing in CoLab is easy:

!pip install allennlp

Then it’s just a simple import to get started:

from allennlp.commands.elmo import ElmoEmbedder
elmo = ElmoEmbedder()

And we will bring scipy along for the ride just to simplify some code:

import scipy

What’s that? I forgot to tell you how to install scipy? No I didn’t. Because Google ignores the batteries not included mentality of Python and actually includes the batteries that most people will need. Mucking around with installing stuff is a big barrier to researchers getting stuff done. Well done Google, you’ll be great overlords eventually.

Consider for a moment the word Spring. It has different meanings. There can be a spring of water, a spring can be associated with bounce (i.e. a coil), spring can be associated with hayfever (spring the season), metal (coil again), flowers (season) and so on. This is a challenge for young students acquiring vocabulary.

We can use ELMo to work out how “close” the different usages are:

tokens = ["spring"]
vectors = elmo.embed_sentence(tokens)
vectors2 = elmo.embed_sentence(["I","drank","some","water"])
print (scipy.spatial.distance.cosine(vectors[2][0], vectors2[2][3]))

vectors2 = elmo.embed_sentence(["I","made","it","bounce"])
print (scipy.spatial.distance.cosine(vectors[2][0], vectors2[2][3]))

Which gives us the output:

0.725758969783783
0.6950635313987732

So the word bounce, as used in the sentence “I made it bounce” is closer to the word Spring than the word water in the sentence “I drank the water”. For context, the word computer in the sentence “I used the computer” gets a distance of 0.8088545650243759. It’s further away, which makes sense because the word computer isn’t readily associated with the word spring. In contrast, the word spring and the word spring in the sentence “I can spring” are very close to each other (0.37859785556793213).

This is great for evaluating students learning language because they generally asked to define a word as well as using it in an example sentence. We can use this kind of code to measure the distance between their usages and the usages that have been taught in (an experimental) classroom setting.

But where are these distances coming from? Well that’s hard to say because the papers about this stuff are full of jargon and largely incomprehensible. But imagine you have a billion words in sentences and you turn that into a big ball of wibbly wobbly vectors somehow. Then you can measure distances in this space. Or not. I’m not the boss of you.

But which billion words? Who said them? I can’t find out. Certainly with the amount of reasonable effort I’ve put in. We’ve measured the closeness of words as they are used, but words used by whom? Maybe a bunch of old white American men. How close are the words socialist and communist then? In the USA socialism has been a bad word You’d get a different distance answer from a tool created with an Australian or American corpus of words. You just sort of hope that corpus is large enough to have some Australian texts in there. Or what about stuff written by woman? Or by young people? Or written in the last 5 years? The list could go on.

This matters because so much software is built using these basic building blocks. Search engines. Recommendation systems. You name it. But they are largely black boxes.