thehydrogen
thehydrogen

Reputation: 93

Similarity in Spacy

I am trying to understand how similarity in Spacy works. I tried using Melania Trump's speech and Michelle Obama's speech to see how similar they were.

This is my code.

import spacy
nlp = spacy.load('en_core_web_lg')

file1 = open("melania.txt").read().decode('ascii', 'ignore')
file2 = open("michelle.txt").read().decode('ascii', 'ignore')

doc1 = nlp(unicode(file1))
doc2 = nlp(unicode(file2))
print doc1.similarity(doc2)

I get the similarity score as 0.9951584208511974. This similarity score looks very high to me. Is this correct? Am I doing something wrong?

Upvotes: 9

Views: 18561

Answers (2)

teja chebrolu
teja chebrolu

Reputation: 61

SpaCy's similarity for a sentence or a document is just the average of all the word vectors that constitute them. Hence, if 2 speeches (these will be multiple sentences)

  • have a lot of positive words
  • are produced in similar circumstances
  • use commonly used words

then the similarity between the associated word vector for each speech might be high. But if you do the same with just single short sentences, then it fails semantically.

For example, consider the two sentences below:

sentence 1: "This is about airplanes and airlines"

sentence 2: "This is not about airplanes and airlines"

Both sentences will give a high similarity score (0.989662) in SpaCy despite meaning the opposite. It seems that the vector of not is not that different from the rest of the words in the sentence and its vector_norm is also similar.

Upvotes: 6

Srce Cde
Srce Cde

Reputation: 1824

By default spaCy calculates cosine similarity. Similarity is determined by comparing word vectors or word embeddings, multi-dimensional meaning representations of a word.

It returns return (numpy.dot(self.vector, other.vector) / (self_norm * other_norm))

text1 = 'How can I end violence?'
text2 = 'What should I do to be a peaceful?'
doc1 = nlp(text1)
doc2 = nlp(text2)
print("spaCy :", doc1.similarity(doc2))

print(np.dot(doc1.vector, doc2.vector) / (np.linalg.norm(doc1.vector) * np.linalg.norm(doc2.vector)))

Output:

spaCy : 0.916553147896471
0.9165532

It seems that spaCy's .vector method created the vectors. Documentation says that spaCy's models are trained from GloVe's vectors.

Upvotes: 20

Related Questions