pythonnlpword2vecbert-language-modelcosine-similarity

Reputation: 129

How to use word embeddings (i.e., Word2vec, GloVe or BERT) to calculate the most word similarity in a set of N words?

I am trying to calculate the semantic similarity by inputting the word list and output a word, which is the most word similarity in the list.

E.g.

If I pass in a list of words

words = ['portugal', 'spain', 'belgium', 'country', 'netherlands', 'italy']

It should output me something like this-

['country']

Upvotes: 5

Answers (2)

Abhi25t

Reputation: 4693

GloVe Embeddings

To load pre-trained GloVe embeddings, we'll use a package called torchtext. It contains other useful tools for working with text that we will see later in the course. The documentation for torchtext GloVe vectors are available at: https://torchtext.readthedocs.io/en/latest/vocab.html#glove

Begin by loading a set of GloVe embeddings. The first time you run the code below, Python will download a large file (862MB) containing the pre-trained embeddings.

import torch
import torchtext

glove = torchtext.vocab.GloVe(name="6B", # trained on Wikipedia 2014 corpus of 6 billion words
                              dim=50)   # embedding size = 100

Let's look at what the embedding of the word "car" looks like:

glove['cat']

tensor([ 0.4528, -0.5011, -0.5371, -0.0157, 0.2219, 0.5460, -0.6730, -0.6891, 0.6349, -0.1973, 0.3368, 0.7735, 0.9009, 0.3849, 0.3837, 0.2657, -0.0806, 0.6109, -1.2894, -0.2231, -0.6158, 0.2170, 0.3561, 0.4450, 0.6089, -1.1633, -1.1579, 0.3612, 0.1047, -0.7832, 1.4352, 0.1863, -0.2611, 0.8328, -0.2312, 0.3248, 0.1449, -0.4455, 0.3350, -0.9595, -0.0975, 0.4814, -0.4335, 0.6945, 0.9104, -0.2817, 0.4164, -1.2609, 0.7128, 0.2378])

It is a torch tensor with dimension (50,). It is difficult to determine what each number in this embedding means, if anything. However, we know that there is structure in this embedding space. That is, distances in this embedding space is meaningful.

Measuring Distance

To explore the structure of the embedding space, it is necessary to introduce a notion of distance. You are probably already familiar with the notion of the Euclidean distance. The Euclidean distance of two vectors x=[x1,x2,...xn] and y=[y1,y2,...yn] is just the 2-norm of their difference x−y.

The PyTorch function torch.norm computes the 2-norm of a vector for us, so we can compute the Euclidean distance between two vectors like this:

x = glove['cat']
y = glove['dog']
torch.norm(y - x)

tensor(1.8846)

Cosine Similarity is an alternative measure of distance. The cosine similarity measures the angle between two vectors, and has the property that it only considers the direction of the vectors, not their the magnitudes. (We'll use this property next class.)

x = torch.tensor([1., 1., 1.]).unsqueeze(0)
y = torch.tensor([2., 2., 2.]).unsqueeze(0)
torch.cosine_similarity(x, y) # should be one

tensor([1.])

The cosine similarity is a similarity measure rather than a distance measure: The larger the similarity, the "closer" the word embeddings are to each other.

x = glove['cat']
y = glove['dog']
torch.cosine_similarity(x.unsqueeze(0), y.unsqueeze(0))

tensor([0.9218])

Word Similarity

Now that we have a notion of distance in our embedding space, we can talk about words that are "close" to each other in the embedding space. For now, let's use Euclidean distances to look at how close various words are to the word "cat".

word = 'cat'
other = ['dog', 'bike', 'kitten', 'puppy', 'kite', 'computer', 'neuron']
for w in other:
    dist = torch.norm(glove[word] - glove[w]) # euclidean distance
    print(w, float(dist))

dog 1.8846031427383423

bike 5.048375129699707

kitten 3.5068609714508057

puppy 3.0644655227661133

kite 4.210376262664795

computer 6.030652046203613

neuron 6.228669166564941

In fact, we can look through our entire vocabulary for words that are closest to a point in the embedding space -- for example, we can look for words that are closest to another word like "cat".

def print_closest_words(vec, n=5):
    dists = torch.norm(glove.vectors - vec, dim=1)     # compute distances to all words
    lst = sorted(enumerate(dists.numpy()), key=lambda x: x[1]) # sort by distance
    for idx, difference in lst[1:n+1]:                         # take the top n
        print(glove.itos[idx], difference)

print_closest_words(glove["cat"], n=10)

dog 1.8846031

rabbit 2.4572797

monkey 2.8102052

cats 2.8972247

rat 2.9455352

beast 2.9878407

monster 3.0022194

pet 3.0396757

snake 3.0617998

puppy 3.0644655

print_closest_words(glove['nurse'])

doctor 3.1274529

dentist 3.1306612

nurses 3.26872

pediatrician 3.3212206

counselor 3.3987114

print_closest_words(glove['computer'])

computers 2.4362664

software 2.926823

technology 3.190351

electronic 3.5067408

computing 3.5999784

We could also look at which words are closest to the midpoints of two words:

print_closest_words((glove['happy'] + glove['sad']) / 2)

happy 1.9199749

feels 2.3604643

sorry 2.4984782

hardly 2.52593

imagine 2.5652788

print_closest_words((glove['lake'] + glove['building']) / 2)

surrounding 3.0698414

nearby 3.1112068

bridge 3.1585503

along 3.1610188

shore 3.1618817

Analogies

One surprising aspect of GloVe vectors is that the directions in the embedding space can be meaningful. The structure of the GloVe vectors certain analogy-like relationship like this tend to hold:

king−man+woman≈queen

print_closest_words(glove['king'] - glove['man'] + glove['woman'])

queen 2.8391209

prince 3.6610038

elizabeth 3.7152522

daughter 3.8317878

widow 3.8493774

We get reasonable answers like "queen", "throne" and the name of our current queen.

We can likewise flip the analogy around:

print_closest_words(glove['queen'] - glove['woman'] + glove['man'])

king 2.8391209

prince 3.2508988

crown 3.4485192

knight 3.5587437

coronation 3.6198905

Or, try a different but related analogies along the gender axis:

print_closest_words(glove['king'] - glove['prince'] + glove['princess'])

queen 3.1845968

king 3.9103293

bride 4.285721

lady 4.299571

sister 4.421178

print_closest_words(glove['uncle'] - glove['man'] + glove['woman'])

grandmother 2.323353

aunt 2.3527892

granddaughter 2.3615322

daughter 2.4039288

uncle 2.6026237

print_closest_words(glove['grandmother'] - glove['mother'] + glove['father'])

uncle 2.0784423

father 2.0912483

grandson 2.2965577

nephew 2.353551

elder 2.4274695

print_closest_words(glove['old'] - glove['young'] + glove['father'])

father 4.0326614

son 4.4065413

grandfather 4.51851

grandson 4.722089

daughter 4.786716

We can move an embedding towards the direction of "goodness" or "badness":

print_closest_words(glove['programmer'] - glove['bad'] + glove['good'])

versatile 4.381561

creative 4.5690007

entrepreneur 4.6343737

enables 4.7177725

intelligent 4.7349973

print_closest_words(glove['programmer'] - glove['good'] + glove['bad'])

hacker 3.8383653

glitch 4.003873

originator 4.041952

hack 4.047719

serial 4.2250676

Bias in Word Vectors

Machine learning models have an air of "fairness" about them, since models make decisions without human intervention. However, models can and do learn whatever bias is present in the training data!

GloVe vectors seems innocuous enough: they are just representations of words in some embedding space. Even so, we'll show that the structure of the GloVe vectors encodes the everyday biases present in the texts that they are trained on.

We'll start with an example analogy:

doctor−man+woman≈??

Let's use GloVe vectors to find the answer to the above analogy:

print_closest_words(glove['doctor'] - glove['man'] + glove['woman'])

nurse 3.1355345

pregnant 3.7805371

child 3.78347

woman 3.8643107

mother 3.922231

The doctor−man+woman≈nurse analogy is very concerning. Just to verify, the same result does not appear if we flip the gender terms:

print_closest_words(glove['doctor'] - glove['woman'] + glove['man'])

man 3.9335632

colleague 3.975502

himself 3.9847782

brother 3.9997008

another 4.029071

We see similar types of gender bias with other professions.

print_closest_words(glove['programmer'] - glove['man'] + glove['woman'])

prodigy 3.6688528

psychotherapist 3.8069527

therapist 3.8087194

introduces 3.9064546

swedish-born 4.1178856

Beyond the first result, none of the other words are even related to programming! In contrast, if we flip the gender terms, we get very different results:

print_closest_words(glove['programmer'] - glove['woman'] + glove['man'])

setup 4.002241

innovator 4.0661883

programmers 4.1729574

hacker 4.2256656

genius 4.3644104

Here are the results for "engineer":

print_closest_words(glove['engineer'] - glove['man'] + glove['woman'])

technician 3.6926973

mechanic 3.9212747

pioneer 4.1543956

pioneering 4.1880875

educator 4.2264576

print_closest_words(glove['engineer'] - glove['woman'] + glove['man'])

builder 4.3523865

mechanic 4.402976

engineers 4.477985

worked 4.5281315

replacing 4.600204

Upvotes: 8

Roohollah Etemadi

Reputation: 1403

First, pretrained word2vec trained on Google News needs to be downloaded from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit.

Then, the cosine similarity between the embedding of words can be computed as follows:

import gensim
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim.models.keyedvectors import KeyedVectors
from numpy import dot
from numpy.linalg import norm

def cosine_sim(a,b):
    return dot(a, b)/(norm(a)*norm(b))

# load the w2v model
path_pretraind_model='./GoogleNews-vectors-negative300.bin/GoogleNews-vectors-negative300.bin'  #set as the path of pretraind model 
model = KeyedVectors.load_word2vec_format(path_pretraind_model, binary=True)


wlist = ['portugal', 'spain', 'belgium', 'country', 'netherlands', 'italy']
lenwlist=len(wlist)
avrsim=[]
#compute cosine similarity between each word in wlist with the other words in wlist  
for i in range(lenwlist):
    word=wlist[i]
    totalsim=0
    wordembed=model[word] 
    for j in range(lenwlist):
        if i!=j:
            word2embed=model[wlist[j]] 
            totalsim+=cosine_sim(wordembed, word2embed)
    avrsim.append(totalsim/ (lenwlist-1)) #add the average similarity between word and any other words in wlist   

index_min=avrsim.index(min(avrsim)) #get min similarity        
print(wlist[index_min])

By similarity if you mean the cosine similarity between the embedding of words, "country" has the least similarity to other words.

Upvotes: 3

How to use word embeddings (i.e., Word2vec, GloVe or BERT) to calculate the most word similarity in a set of N words?

Answers (2)

Related Questions