Reputation: 142
In the tutorial example of spaCy in Python the results of apples.similarity(oranges)
is
0.39289959293092641
instead of 0.7857989796519943
Any reasons for that? Original docs of the tutorial https://spacy.io/docs/ A tutorial with a different answer to the one I get: http://textminingonline.com/getting-started-with-spacy
Thanks
Upvotes: 3
Views: 2386
Reputation: 616
That can be because one of the comparing text has an out-of-vocabulary word. Note: OOV words are different for different spacy models! Models have different vocabularies.
Upvotes: 0
Reputation: 4297
Thanks to Ethan's report on the issue tracker, this is now fixed.
You'll also get the GloVe vectors by default now — so similarities should in general be more accurate.
Upvotes: 2
Reputation: 485
That appears to be a bug in spacy.
Somehow vector_norm
is incorrectly calculated.
import spacy
import numpy as np
nlp = spacy.load("en")
# using u"apples" just as an example
apples = nlp.vocab[u"apples"]
print apples.vector_norm
# prints 1.4142135381698608, or sqrt(2)
print np.sqrt(np.dot(apples.vector, apples.vector))
# prints 1.0
Then vector_norm
is used in similarity
, which always returns a value that is always half of the correct value.
def similarity(self, other):
if self.vector_norm == 0 or other.vector_norm == 0:
return 0.0
return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)
If you are ranking similarity scores for synonyms, this might be OK. But if you need the correct cosine similarity score, then the result is incorrect.
I submitted the issue here. Hopefully it will get fixed soon.
Upvotes: 9