Nic Scozzaro
Nic Scozzaro

Reputation: 7373

In Spacy how can I efficiently compare the similarity of one document to all others?

For my application I'm comparing the similarity of one document against all other documents because I want to find the most similar other documents. In Gensim this can be done efficiently using the MatrixSimilarity method.

In Spacy's documentation they have the example for comparing multiple documents, however for many documents the loop is not an efficient implementation:

import spacy
nlp = spacy.load('en_core_web_lg')

doc1 = nlp(u"The labrador barked.")
doc2 = nlp(u"The labrador swam.")
doc3 = nlp(u"the labrador people live in canada.")

for doc in [doc1, doc2, doc3]:
    labrador = doc[1]
    dog = nlp(u"dog")
    print(labrador.similarity(dog))

If someone could please suggest an efficient way compare one document to all others in Spacy it would be much appreciated.

I believe it may involve using a pipeline, but I'm not sure how to use those.

I'll note that the example above from the documentation seems to have an issue, so any ideas for how get around that issue are also welcome.

Upvotes: 4

Views: 1505

Answers (1)

KonstantinosKokos
KonstantinosKokos

Reputation: 3473

Depending on your application and amount of sentences to compare, I would suggest creating an array containing all your sentence vectors, normalized. A matrix multiplication with its transpose would then result in all of the similarity pairs in a rather efficient way.

Upvotes: 3

Related Questions