Reputation: 31
I am attempting to run an LDA topic model using Mallet. My corpus consists of user comments from news websites. It's a relatively small corpus with approx. 614k words.
The first approach I took was to split these words across just a dozen, very large documents. I got results but they were not very good, which retrospectively I believe was to be expected. I have now increased the number of documents by chunking the corpus in 614 documents equally with 1k words each. Using this approach I find it much easier to identify topics (and more topics) via the top words, they just seem to be topically more in line. On the other hand, however, I now get much "worse" coherence and exclusivity scores. Using just a dozen documents, coherence was somewhere between -12 (best) and -200 (worst). With the hundreds of documents now, coherence value spread from -120 (best) to -1000 (worst). I used the same parameters for both approaches (Gibbs Sampling of 1000 and optimization intervall set at 10).
I guess my questions are:
Many thanks in advance!
Upvotes: 0
Views: 50