Reputation: 7373
For my application I'm comparing the similarity of one document against all other documents because I want to find the most similar other documents. In Gensim this can be done efficiently using the MatrixSimilarity method.
In Spacy's documentation they have the example for comparing multiple documents, however for many documents the loop is not an efficient implementation:
import spacy
nlp = spacy.load('en_core_web_lg')
doc1 = nlp(u"The labrador barked.")
doc2 = nlp(u"The labrador swam.")
doc3 = nlp(u"the labrador people live in canada.")
for doc in [doc1, doc2, doc3]:
labrador = doc[1]
dog = nlp(u"dog")
print(labrador.similarity(dog))
If someone could please suggest an efficient way compare one document to all others in Spacy it would be much appreciated.
I believe it may involve using a pipeline, but I'm not sure how to use those.
I'll note that the example above from the documentation seems to have an issue, so any ideas for how get around that issue are also welcome.
Upvotes: 4
Views: 1505
Reputation: 3473
Depending on your application and amount of sentences to compare, I would suggest creating an array containing all your sentence vectors, normalized. A matrix multiplication with its transpose would then result in all of the similarity pairs in a rather efficient way.
Upvotes: 3