Reputation: 616
The spaCy similarity works strange sometimes. If we compare the completely equal texts, we got a score of 1.0. but the texts are almost equal we can get a score > 1. This behavior could harm our code. Why we got this > 1.0 score and can we predict it?
def calc_score(text_source, text_target):
return nlp(text_source).similarity(nlp(text_target))
# nlp = spacy.load('en_core_web_md')
calc_score('software development', 'Software development')
# 1.0000000155153665
Upvotes: 0
Views: 592
Reputation: 3781
From https://spacy.io/usage/vectors-similarity
:
Identical tokens are obviously 100% similar to each other (just not always exactly 1.0, because of vector math and floating point imprecisions).
Just use np.clip as per https://stackoverflow.com/a/13232356/447599
Upvotes: 1