Reputation: 93
I am trying to understand how similarity in Spacy works. I tried using Melania Trump's speech and Michelle Obama's speech to see how similar they were.
This is my code.
import spacy
nlp = spacy.load('en_core_web_lg')
file1 = open("melania.txt").read().decode('ascii', 'ignore')
file2 = open("michelle.txt").read().decode('ascii', 'ignore')
doc1 = nlp(unicode(file1))
doc2 = nlp(unicode(file2))
print doc1.similarity(doc2)
I get the similarity score as 0.9951584208511974. This similarity score looks very high to me. Is this correct? Am I doing something wrong?
Upvotes: 9
Views: 18561
Reputation: 61
SpaCy's similarity for a sentence or a document is just the average of all the word vectors that constitute them. Hence, if 2 speeches (these will be multiple sentences)
then the similarity between the associated word vector for each speech might be high. But if you do the same with just single short sentences, then it fails semantically.
For example, consider the two sentences below:
sentence 1: "This is about airplanes and airlines"
sentence 2: "This is not about airplanes and airlines"
Both sentences will give a high similarity score (0.989662
) in SpaCy despite meaning the opposite. It seems that the vector of not is not that different from the rest of the words in the sentence and its vector_norm
is also similar.
Upvotes: 6
Reputation: 1824
By default spaCy calculates cosine similarity. Similarity is determined by comparing word vectors or word embeddings, multi-dimensional meaning representations of a word.
It returns return (numpy.dot(self.vector, other.vector) / (self_norm * other_norm))
text1 = 'How can I end violence?'
text2 = 'What should I do to be a peaceful?'
doc1 = nlp(text1)
doc2 = nlp(text2)
print("spaCy :", doc1.similarity(doc2))
print(np.dot(doc1.vector, doc2.vector) / (np.linalg.norm(doc1.vector) * np.linalg.norm(doc2.vector)))
Output:
spaCy : 0.916553147896471
0.9165532
It seems that spaCy's .vector
method created the vectors. Documentation says that spaCy's models are trained from GloVe's vectors.
Upvotes: 20