Leonid Ganeline
Leonid Ganeline

Reputation: 616

Similarity between two lists of documents

I need to find the similarity between two lists of the short texts in Python. Texts can be 1-4 word long. The length of the lists can be 10K each. So, I need to effectively calculate 10K*10K=100M similarity scores. I didn't find how to do this effectively in spaCy. Maybe other packages can do this? I assume the words are represented by a vector (300d), but any other options are also Ok. This task can be done in a cycle, but there should be a more effective way for sure. This task fits the TensorFlow, pyTorch, and similar packages, but I'm not familiar with details of these packages.

Upvotes: 0

Views: 1771

Answers (2)

Leonid Ganeline
Leonid Ganeline

Reputation: 616

The solution was to use something like Spotify Annoy which uses Approximate Nearest Neighbours method. There are some other libraries to do the nearest neighbour search.

Upvotes: 0

simbamford
simbamford

Reputation: 71

I think your question is ambiguous - You might mean to produce a single similarity score for the similarity of the average of list 1 vs the average of list 2. I'm assuming that you want a similarity score for each combination of items from the two lists. For 10K items per list, that will produce 10K pow 2 = 100M similarity scores.

import spacy
spacyModel = spacy.load('en')

list1 = ["hello, example 1", "right, second example"]
list2 = ["hello, example 1 in the second list", "And now for something completely different"]

list1SpacyDocs = [spacyModel(x) for x in list1]
list2SpacyDocs = [spacyModel(x) for x in list2]

similarityMatrix = [[x.similarity(y) for x in list1SpacyDocs] for y in list2SpacyDocs]

print(similarityMatrix)
[[0.8537950408055295, 0.8852732956832498], [0.5802435148988874, 0.7643245611465626]]

Upvotes: 1

Related Questions