In Spacy how can I efficiently compare the similarity of one document to all others?

Question

For my application I'm comparing the similarity of one document against all other documents because I want to find the most similar other documents. In Gensim this can be done efficiently using the MatrixSimilarity method.

In Spacy's documentation they have the example for comparing multiple documents, however for many documents the loop is not an efficient implementation:

import spacy
nlp = spacy.load('en_core_web_lg')

doc1 = nlp(u"The labrador barked.")
doc2 = nlp(u"The labrador swam.")
doc3 = nlp(u"the labrador people live in canada.")

for doc in [doc1, doc2, doc3]:
    labrador = doc[1]
    dog = nlp(u"dog")
    print(labrador.similarity(dog))

If someone could please suggest an efficient way compare one document to all others in Spacy it would be much appreciated.

I believe it may involve using a pipeline, but I'm not sure how to use those.

I'll note that the example above from the documentation seems to have an issue, so any ideas for how get around that issue are also welcome.

In Spacy how can I efficiently compare the similarity of one document to all others?

Answers (1)

Related Questions