Reputation: 935
I have trained a LDA model on a corpus using Gensim. Now that I have the topic distribution for each document, how can I compare how similar two documents are in topics? I would like to have a summary measure. For example, the following are the topic distributions of two documents. There are totally 75 topics. For brevity, I show only the first 10 topics with largest probabilities (so the topics are not in order). (40, 0.5523168) means that topic #40 has a probability of 0.5523168 for DOC #1. Should I calculate the Euclidean or Cosine distance between the two vectors? And using this summary measure, can I say that, for example, DOC 1 is more similar to DOC2 than to DOC3, or DOC1 and DOC 2 are more similar to each other than DOC 3 and DOC 4 topically? Thank you!
DOC #1:
[(40, 0.5523168), (60, 0.12225048), (43, 0.07556598), (41, 0.065885976),
(22, 0.05838573), (24, 0.044774733), (74, 0.019839266), (65, 0.019544959),
(51, 0.015470431), (36, 0.013449047)]
DOC #2:
[(73, 0.58864516), (41, 0.16827711), (51, 0.09783472), (63, 0.06510383),
(24, 0.04722658), (32, 0.014467965), (44, 0.012267662), (47, 0.0031533625),
(18, 0.0022214972), (0, 1.2154361e-05)]
Upvotes: 4
Views: 4258
Reputation: 691
Gensim Functionality
Gensim provides the similarities.docsim
functionality - to "compute similarities across a collection of documents in the Vector Space Model." You can see the documentation here, there is also a tutorial here for the similarity queries.
Document Similarity Measures
Using euclidian distances would be an uncommon choice - you could, but there are potential issues. You could use cosine similarity (link to python tutorial) - this takes the cosine of the angle of two document vectors, which has the advantage of being easily understood (1= the documents are perfectly alike, to -1=the documents have no similarity at all) and yes, you can compare the cosine similarity of documents 1 & 2 and compare it to that of documents 3 & 4, or calculate the similarity values of doc1 to doc2 and doc1 and doc3 and compare them. There is a pretty good tutorial here.
You might also find my answer to this question over at CrossValidated informative, even though your question is somewhat different.
Gensim also has other distance metrics available. These are pretty much all included in gensim's matutils
.
Topical distances
You can also measure distances between topics using (some) of these distances in the above link, such as Hellinger distance.
Upvotes: 6