empty topics in Mallet LDA topic modeling

Question

When I'm running Mallet LDA with higher number of topics ( eg. T > 300) I get topics with empty topic words (doesn't have a single topic word).

Why is that happening? Is this a bug in Mallet?

I'm using mallet 2.0.7 on a ubuntu 14.04 machine.

EDIT

mallet-2.0.7/bin/mallet import-dir --input $path/$posts --output $outputDir/$posts.mallet \
        --keep-sequence --remove-stopwords --token-regex "[\p{Alpha}_]+"  #--save-text-in-source

  mallet-2.0.7/bin/mallet train-topics --input $outputDir/$posts.mallet \
        --num-topics $topics --output-state $outputDir/topic-state.gz \
        --output-topic-keys $outputDir/topics.txt --output-doc-topics $outputDir/document_composition.txt \
        --topic-word-weights-file $outputDir/topic_word_weights.txt --num-top-words $numtopwords \
        --optimize-interval 10 --word-topic-counts-file $outputDir/topic_counts.txt

As for the corpus details, it contains about 1000 files. each file may contain one or few sentences. Corpus is pretty small about 1 MB in size.

samsamara · Accepted Answer

Answer I got from David Mimno:

This usually indicates that you have a large number of topics relative to the size of the corpus. Mallet uses Gibbs sampling, so the topics are based on actual counts of tokens currently assigned to a topic. There's nothing wrong with these "empty" topics per se, as long as you know not to put too much trust in them.

empty topics in Mallet LDA topic modeling

Answers (1)

Related Questions