Charlie Hertz
Charlie Hertz

Reputation: 1

Is there a method in Gensim to find the most relevant topics between two corpuses?

I am trying to better understand how two groups of documents relate to one another through topic modeling. I have performed similarity scoring on them and would like to try and peer deeper into how these documents relate through topic modeling. Rather than just observing the most relevant topics for each document using LDA, is there a method where I could have a model trained on both documents combined as a single corpus and visualize what topics have the most relevance to both documents combined?

I tried just running LDA on a combined corpus but it returned topics that were clearly divided in relevance between the two different underlying documents of origin. Instead, I want to see what smaller topics the two documents overlap with the most.

Upvotes: 0

Views: 271

Answers (2)

rchurch4
rchurch4

Reputation: 899

Gojomo's answer was pretty comprehensive. However, if you want to do it relatively automatically, based on which words are in each topic, consider using Cross-Source Topic Blending. CSTB is based on the problem that given a bunch of topic sets from different data sets, can you find the overlapping topics? It works by calculating the similarity between topics from each topic set, and can be used to find topics that overlap in all of the data sets, or just a few. For instance, by setting the source_thresholds to 2, you can stipulate that a topic must be shared by any two topic sets. Setting it higher, say, 5 if you have 5 topic sets, will be more selective, showing you on topics that are shared by all the topic sets.

You can set the word_threshold parameter to be how ever many words you want to topics to share (higher means that it's harder to match topics, but that the topics that are matched are more closely aligned).

Setting the topn parameter tells CSTB how many words per topic to consider, so that you are matching only the best words in each topic and not ones that hardly belong to a topic.

Upvotes: 0

gojomo
gojomo

Reputation: 54173

There's no one method for doing this in Gensim. But once you've trained a topic-model (such as LDA) on the combined corpus of all documents, you could do things like:

  • compare any two documents, by comparing their topics
  • tally top-N topics for all documents in one of the original corpuses, and then top-N topics for all documents the 2nd original corpus, then contrast those counts
  • treat the original two corpuses as two giant composite documents, calculate the topics of those two synthetic documents, and compare their topics

Upvotes: 0

Related Questions