Reputation: 129
I am trying to calculate the semantic similarity by inputting the word list and output a word, which is the most word similarity in the list.
E.g.
If I pass in a list of words
words = ['portugal', 'spain', 'belgium', 'country', 'netherlands', 'italy']
It should output me something like this-
['country']
Upvotes: 5
Views: 9766
Reputation: 4693
GloVe Embeddings
To load pre-trained GloVe embeddings, we'll use a package called torchtext
. It contains other useful tools for working with text that we will see later in the course. The documentation for torchtext GloVe vectors are available at: https://torchtext.readthedocs.io/en/latest/vocab.html#glove
Begin by loading a set of GloVe embeddings. The first time you run the code below, Python will download a large file (862MB) containing the pre-trained embeddings.
import torch
import torchtext
glove = torchtext.vocab.GloVe(name="6B", # trained on Wikipedia 2014 corpus of 6 billion words
dim=50) # embedding size = 100
Let's look at what the embedding of the word "car" looks like:
glove['cat']
tensor([ 0.4528, -0.5011, -0.5371, -0.0157, 0.2219, 0.5460, -0.6730, -0.6891, 0.6349, -0.1973, 0.3368, 0.7735, 0.9009, 0.3849, 0.3837, 0.2657, -0.0806, 0.6109, -1.2894, -0.2231, -0.6158, 0.2170, 0.3561, 0.4450, 0.6089, -1.1633, -1.1579, 0.3612, 0.1047, -0.7832, 1.4352, 0.1863, -0.2611, 0.8328, -0.2312, 0.3248, 0.1449, -0.4455, 0.3350, -0.9595, -0.0975, 0.4814, -0.4335, 0.6945, 0.9104, -0.2817, 0.4164, -1.2609, 0.7128, 0.2378])
It is a torch tensor with dimension (50,). It is difficult to determine what each number in this embedding means, if anything. However, we know that there is structure in this embedding space. That is, distances in this embedding space is meaningful.
Measuring Distance
To explore the structure of the embedding space, it is necessary to introduce a notion of distance. You are probably already familiar with the notion of the Euclidean distance. The Euclidean distance of two vectors x=[x1,x2,...xn] and y=[y1,y2,...yn] is just the 2-norm of their difference x−y.
The PyTorch function torch.norm
computes the 2-norm of a vector for us, so we can compute the Euclidean distance between two vectors like this:
x = glove['cat']
y = glove['dog']
torch.norm(y - x)
tensor(1.8846)
Cosine Similarity is an alternative measure of distance. The cosine similarity measures the angle between two vectors, and has the property that it only considers the direction of the vectors, not their the magnitudes. (We'll use this property next class.)
x = torch.tensor([1., 1., 1.]).unsqueeze(0)
y = torch.tensor([2., 2., 2.]).unsqueeze(0)
torch.cosine_similarity(x, y) # should be one
tensor([1.])
The cosine similarity is a similarity measure rather than a distance measure: The larger the similarity, the "closer" the word embeddings are to each other.
x = glove['cat']
y = glove['dog']
torch.cosine_similarity(x.unsqueeze(0), y.unsqueeze(0))
tensor([0.9218])
Word Similarity
Now that we have a notion of distance in our embedding space, we can talk about words that are "close" to each other in the embedding space. For now, let's use Euclidean distances to look at how close various words are to the word "cat".
word = 'cat'
other = ['dog', 'bike', 'kitten', 'puppy', 'kite', 'computer', 'neuron']
for w in other:
dist = torch.norm(glove[word] - glove[w]) # euclidean distance
print(w, float(dist))
dog 1.8846031427383423
bike 5.048375129699707
kitten 3.5068609714508057
puppy 3.0644655227661133
kite 4.210376262664795
computer 6.030652046203613
neuron 6.228669166564941
In fact, we can look through our entire vocabulary for words that are closest to a point in the embedding space -- for example, we can look for words that are closest to another word like "cat".
def print_closest_words(vec, n=5):
dists = torch.norm(glove.vectors - vec, dim=1) # compute distances to all words
lst = sorted(enumerate(dists.numpy()), key=lambda x: x[1]) # sort by distance
for idx, difference in lst[1:n+1]: # take the top n
print(glove.itos[idx], difference)
print_closest_words(glove["cat"], n=10)
dog 1.8846031
rabbit 2.4572797
monkey 2.8102052
cats 2.8972247
rat 2.9455352
beast 2.9878407
monster 3.0022194
pet 3.0396757
snake 3.0617998
puppy 3.0644655
print_closest_words(glove['nurse'])
doctor 3.1274529
dentist 3.1306612
nurses 3.26872
pediatrician 3.3212206
counselor 3.3987114
print_closest_words(glove['computer'])
computers 2.4362664
software 2.926823
technology 3.190351
electronic 3.5067408
computing 3.5999784
We could also look at which words are closest to the midpoints of two words:
print_closest_words((glove['happy'] + glove['sad']) / 2)
happy 1.9199749
feels 2.3604643
sorry 2.4984782
hardly 2.52593
imagine 2.5652788
print_closest_words((glove['lake'] + glove['building']) / 2)
surrounding 3.0698414
nearby 3.1112068
bridge 3.1585503
along 3.1610188
shore 3.1618817
Analogies
One surprising aspect of GloVe vectors is that the directions in the embedding space can be meaningful. The structure of the GloVe vectors certain analogy-like relationship like this tend to hold:
king−man+woman≈queen
print_closest_words(glove['king'] - glove['man'] + glove['woman'])
queen 2.8391209
prince 3.6610038
elizabeth 3.7152522
daughter 3.8317878
widow 3.8493774
We get reasonable answers like "queen", "throne" and the name of our current queen.
We can likewise flip the analogy around:
print_closest_words(glove['queen'] - glove['woman'] + glove['man'])
king 2.8391209
prince 3.2508988
crown 3.4485192
knight 3.5587437
coronation 3.6198905
Or, try a different but related analogies along the gender axis:
print_closest_words(glove['king'] - glove['prince'] + glove['princess'])
queen 3.1845968
king 3.9103293
bride 4.285721
lady 4.299571
sister 4.421178
print_closest_words(glove['uncle'] - glove['man'] + glove['woman'])
grandmother 2.323353
aunt 2.3527892
granddaughter 2.3615322
daughter 2.4039288
uncle 2.6026237
print_closest_words(glove['grandmother'] - glove['mother'] + glove['father'])
uncle 2.0784423
father 2.0912483
grandson 2.2965577
nephew 2.353551
elder 2.4274695
print_closest_words(glove['old'] - glove['young'] + glove['father'])
father 4.0326614
son 4.4065413
grandfather 4.51851
grandson 4.722089
daughter 4.786716
We can move an embedding towards the direction of "goodness" or "badness":
print_closest_words(glove['programmer'] - glove['bad'] + glove['good'])
versatile 4.381561
creative 4.5690007
entrepreneur 4.6343737
enables 4.7177725
intelligent 4.7349973
print_closest_words(glove['programmer'] - glove['good'] + glove['bad'])
hacker 3.8383653
glitch 4.003873
originator 4.041952
hack 4.047719
serial 4.2250676
Bias in Word Vectors
Machine learning models have an air of "fairness" about them, since models make decisions without human intervention. However, models can and do learn whatever bias is present in the training data!
GloVe vectors seems innocuous enough: they are just representations of words in some embedding space. Even so, we'll show that the structure of the GloVe vectors encodes the everyday biases present in the texts that they are trained on.
We'll start with an example analogy:
doctor−man+woman≈??
Let's use GloVe vectors to find the answer to the above analogy:
print_closest_words(glove['doctor'] - glove['man'] + glove['woman'])
nurse 3.1355345
pregnant 3.7805371
child 3.78347
woman 3.8643107
mother 3.922231
The doctor−man+woman≈nurse analogy is very concerning. Just to verify, the same result does not appear if we flip the gender terms:
print_closest_words(glove['doctor'] - glove['woman'] + glove['man'])
man 3.9335632
colleague 3.975502
himself 3.9847782
brother 3.9997008
another 4.029071
We see similar types of gender bias with other professions.
print_closest_words(glove['programmer'] - glove['man'] + glove['woman'])
prodigy 3.6688528
psychotherapist 3.8069527
therapist 3.8087194
introduces 3.9064546
swedish-born 4.1178856
Beyond the first result, none of the other words are even related to programming! In contrast, if we flip the gender terms, we get very different results:
print_closest_words(glove['programmer'] - glove['woman'] + glove['man'])
setup 4.002241
innovator 4.0661883
programmers 4.1729574
hacker 4.2256656
genius 4.3644104
Here are the results for "engineer":
print_closest_words(glove['engineer'] - glove['man'] + glove['woman'])
technician 3.6926973
mechanic 3.9212747
pioneer 4.1543956
pioneering 4.1880875
educator 4.2264576
print_closest_words(glove['engineer'] - glove['woman'] + glove['man'])
builder 4.3523865
mechanic 4.402976
engineers 4.477985
worked 4.5281315
replacing 4.600204
Upvotes: 8
Reputation: 1403
First, pretrained word2vec trained on Google News needs to be downloaded from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit.
Then, the cosine similarity between the embedding of words can be computed as follows:
import gensim
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim.models.keyedvectors import KeyedVectors
from numpy import dot
from numpy.linalg import norm
def cosine_sim(a,b):
return dot(a, b)/(norm(a)*norm(b))
# load the w2v model
path_pretraind_model='./GoogleNews-vectors-negative300.bin/GoogleNews-vectors-negative300.bin' #set as the path of pretraind model
model = KeyedVectors.load_word2vec_format(path_pretraind_model, binary=True)
wlist = ['portugal', 'spain', 'belgium', 'country', 'netherlands', 'italy']
lenwlist=len(wlist)
avrsim=[]
#compute cosine similarity between each word in wlist with the other words in wlist
for i in range(lenwlist):
word=wlist[i]
totalsim=0
wordembed=model[word]
for j in range(lenwlist):
if i!=j:
word2embed=model[wlist[j]]
totalsim+=cosine_sim(wordembed, word2embed)
avrsim.append(totalsim/ (lenwlist-1)) #add the average similarity between word and any other words in wlist
index_min=avrsim.index(min(avrsim)) #get min similarity
print(wlist[index_min])
By similarity if you mean the cosine similarity between the embedding of words, "country" has the least similarity to other words.
Upvotes: 3