aiedu
aiedu

Reputation: 142

Python Spacy beginner : similarities function

In the tutorial example of spaCy in Python the results of apples.similarity(oranges) is 0.39289959293092641 instead of 0.7857989796519943

Any reasons for that? Original docs of the tutorial https://spacy.io/docs/ A tutorial with a different answer to the one I get: http://textminingonline.com/getting-started-with-spacy

Thanks

Upvotes: 3

Views: 2386

Answers (3)

Leonid Ganeline
Leonid Ganeline

Reputation: 616

That can be because one of the comparing text has an out-of-vocabulary word. Note: OOV words are different for different spacy models! Models have different vocabularies.

Upvotes: 0

syllogism_
syllogism_

Reputation: 4297

Thanks to Ethan's report on the issue tracker, this is now fixed.

You'll also get the GloVe vectors by default now — so similarities should in general be more accurate.

Upvotes: 2

Ethan
Ethan

Reputation: 485

That appears to be a bug in spacy.

Somehow vector_norm is incorrectly calculated.

import spacy
import numpy as np
nlp = spacy.load("en")
# using u"apples" just as an example
apples = nlp.vocab[u"apples"]
print apples.vector_norm
# prints 1.4142135381698608, or sqrt(2)
print np.sqrt(np.dot(apples.vector, apples.vector))
# prints 1.0

Then vector_norm is used in similarity, which always returns a value that is always half of the correct value.

def similarity(self, other):
    if self.vector_norm == 0 or other.vector_norm == 0:
        return 0.0
    return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)

If you are ranking similarity scores for synonyms, this might be OK. But if you need the correct cosine similarity score, then the result is incorrect.

I submitted the issue here. Hopefully it will get fixed soon.

Upvotes: 9

Related Questions