Nils_Denter
Nils_Denter

Reputation: 498

Dynamic Topic Modeling with Gensim / which code?

I want to use Dynamic Topic Modeling by Blei et al. (http://www.cs.columbia.edu/~blei/papers/BleiLafferty2006a.pdf) for a large corpus of nearly 3800 patent documents. Does anybody has experience in using the DTM in the gensim package? I identified two models:

  1. models.ldaseqmodel – Dynamic Topic Modeling in Python Link
  2. models.wrappers.dtmmodel – Dynamic Topic Models (DTM) Link

Which one did you use, of if you used both, which one is "better"? In better words, which one did/do you prefer?

Upvotes: 5

Views: 4544

Answers (1)

jhl
jhl

Reputation: 691

Both packages work fine, and are pretty much functionally identical. Which one you might want to use depends on your use case. There are small differences in the functions each model comes with, and small differences in the naming, which might be a little confusing, but for most DTM use cases, it does not matter very much which you pick.

Are the model outputs identical?

Not exactly. They are however very, very close to being identical (98%+) - I believe most of the differences come from slightly different handling of the probabilities in the generative process. So far, I've not yet come across a case where a difference in the sixth or seventh digit after the decimal point has any significant meaning. Interpreting the topics your models finds matters much more than one version finding a higher topic loading for some word by 0.00002

The big difference between the two models: dtmmodel is a python wrapper for the original C++ implementation from blei-lab, which means python will run the binaries, while ldaseqmodel is fully written in python.

Why use dtmmodel?

  • the C++ code is faster than the python implementation
  • supports the Document Influence Model from Gerrish/Blei 2010 (potentially interesting for your research, see this paper for an implementation.

Why use ldaseqmodel?

  • easier to install (simple import statement vs downloading binaries)
  • can use sstats from a pretrained LDA model - useful with LdaMulticore
  • easier to understand the workings of the code

I mostly use ldaseqmodel but thats for convenience. Native DIM support would be great to have, though.

What should you do?

Try each of them out, say, on a small sample set and see what the models return. 3800 documents isn't a huge corpus (assuming the patents aren't hundreds of pages each), and I assume that after preprocessing (removing stopwords, images and metadata) your dictionary won't be too large either (lots of standard phrases and legalese in patents, I'd assume). Pick the one that works best for you or has the capabilities you need.

Full analysis might take hours anyway, if you let your code run overnight there is little practical difference, after all, do you care if it finishes at 3am or 5am? If runtime is critical, I would assume the dtmmodel will be more useful.

For implementation examples, you might want to take a look at these notebooks: ldaseqmodel and dtmmodel

Upvotes: 4

Related Questions