clairekelley
clairekelley

Reputation: 447

How can I find token similarity in Spacy?

I am trying to calculate token similarity in spacy. I.e. how close word tokens are to one another. I am using spacy version 2.0.5. Here is my trivial example.

import spacy
from spacy.lang.en import English
from spacy.tokenizer import Tokenizer

nlp = spacy.load('en') 

x = nlp(u'apple')
y = nlp(u'apple')

x.similarity(y)

This returns -81216639937292144.0 but I had expected it to be 1.0.

In addition

x = nlp(u'apple')
y = nlp(u'apples')
x.similarity(y)

returns 0.0038385278814858344 which seems wrong as well.

How should I be doing this token similarity so that it works? I am really trying to stay within Spacy (rather than using a different string distance package) but would also welcome suggestions if this just can't be done in spacy.

Upvotes: 0

Views: 2383

Answers (2)

mux032
mux032

Reputation: 65

I too faced the same problem with version 2.0.5, you can roll back to version 2.0.2 where you will get a score like 1.0000000593284066 for 'apples' comparison to 'apples'.

For this first you have to uninstall all the libraries related to Spacy version 2.0.5,

for dep in $(pip show spacy | grep Requires | sed 's/Requires: //g; s/,//g') ; do pip uninstall -y $dep ; done

Then install version 2.0.2,

pip install spacy=='2.0.2'

Next validate,

python -m spacy validate

It might ask you to install some library, like ftfy or some other and when you try to install, it will be already installed. For those uninstall them first and then reinstall them again(this might happen 3-4 times for different libraries). We have to do this because lot of libraries get updated to latest version while installing spacy 2.0.5. And lastly download the model,

python -m spacy download en_core_web_sm

Upvotes: 0

Tanu
Tanu

Reputation: 1563

I tried doing same using spacy version 0.100.7. It works okay for me

import spacy
from spacy.en import English
from spacy.tokenizer import Tokenizer

nlp = spacy.load('en') 

x = nlp(u'apple')
y = nlp(u'apple')

print (x.similarity(y)) # prints 0.999999947205

x = nlp(u'apple')
y = nlp(u'apple')

print (x.similarity(sy)) # prints 0.6678450944

Can you please check your version of spacy. Also, have you installed only deafult-en model?

Upvotes: 1

Related Questions