Reputation: 498
I want to use Dynamic Topic Modeling by Blei et al. (http://www.cs.columbia.edu/~blei/papers/BleiLafferty2006a.pdf) for a large corpus of nearly 3800 patent documents. Does anybody has experience in using the DTM in the gensim package? I identified two models:
Which one did you use, of if you used both, which one is "better"? In better words, which one did/do you prefer?
Upvotes: 5
Views: 4544
Reputation: 691
Both packages work fine, and are pretty much functionally identical. Which one you might want to use depends on your use case. There are small differences in the functions each model comes with, and small differences in the naming, which might be a little confusing, but for most DTM use cases, it does not matter very much which you pick.
Are the model outputs identical?
Not exactly. They are however very, very close to being identical (98%+) - I believe most of the differences come from slightly different handling of the probabilities in the generative process. So far, I've not yet come across a case where a difference in the sixth or seventh digit after the decimal point has any significant meaning. Interpreting the topics your models finds matters much more than one version finding a higher topic loading for some word by 0.00002
The big difference between the two models: dtmmodel
is a python wrapper for the original C++ implementation from blei-lab, which means python will run the binaries, while ldaseqmodel
is fully written in python.
Why use dtmmodel?
Why use ldaseqmodel?
import
statement vs downloading binaries)sstats
from a pretrained LDA model - useful with LdaMulticore
I mostly use ldaseqmodel
but thats for convenience. Native DIM support would be great to have, though.
What should you do?
Try each of them out, say, on a small sample set and see what the models return. 3800 documents isn't a huge corpus (assuming the patents aren't hundreds of pages each), and I assume that after preprocessing (removing stopwords, images and metadata) your dictionary won't be too large either (lots of standard phrases and legalese in patents, I'd assume). Pick the one that works best for you or has the capabilities you need.
Full analysis might take hours anyway, if you let your code run overnight there is little practical difference, after all, do you care if it finishes at 3am or 5am? If runtime is critical, I would assume the dtmmodel
will be more useful.
For implementation examples, you might want to take a look at these notebooks: ldaseqmodel and dtmmodel
Upvotes: 4