Understanding and improving coherence values using Mallet

Question

I am attempting to run an LDA topic model using Mallet. My corpus consists of user comments from news websites. It's a relatively small corpus with approx. 614k words.

The first approach I took was to split these words across just a dozen, very large documents. I got results but they were not very good, which retrospectively I believe was to be expected. I have now increased the number of documents by chunking the corpus in 614 documents equally with 1k words each. Using this approach I find it much easier to identify topics (and more topics) via the top words, they just seem to be topically more in line. On the other hand, however, I now get much "worse" coherence and exclusivity scores. Using just a dozen documents, coherence was somewhere between -12 (best) and -200 (worst). With the hundreds of documents now, coherence value spread from -120 (best) to -1000 (worst). I used the same parameters for both approaches (Gibbs Sampling of 1000 and optimization intervall set at 10).

I guess my questions are:

Is it correct to assume that the number of documents has an impact on the average coherence value or am I missing something else?
Is there a rule of thumb of some sort that would help me define what a good coherence score is? I am guessing there isn't but just wanted to check if anyone has an idea.

Many thanks in advance!

Understanding and improving coherence values using Mallet

Answers (0)

Related Questions