samsamara
samsamara

Reputation: 4750

empty topics in Mallet LDA topic modeling

When I'm running Mallet LDA with higher number of topics ( eg. T > 300) I get topics with empty topic words (doesn't have a single topic word).

Why is that happening? Is this a bug in Mallet?

I'm using mallet 2.0.7 on a ubuntu 14.04 machine.

EDIT

mallet-2.0.7/bin/mallet import-dir --input $path/$posts --output $outputDir/$posts.mallet \
        --keep-sequence --remove-stopwords --token-regex "[\\p{Alpha}_]+"  #--save-text-in-source

  mallet-2.0.7/bin/mallet train-topics --input $outputDir/$posts.mallet \
        --num-topics $topics --output-state $outputDir/topic-state.gz \
        --output-topic-keys $outputDir/topics.txt --output-doc-topics $outputDir/document_composition.txt \
        --topic-word-weights-file $outputDir/topic_word_weights.txt --num-top-words $numtopwords \
        --optimize-interval 10 --word-topic-counts-file $outputDir/topic_counts.txt

As for the corpus details, it contains about 1000 files. each file may contain one or few sentences. Corpus is pretty small about 1 MB in size.

Upvotes: 1

Views: 653

Answers (1)

samsamara
samsamara

Reputation: 4750

Answer I got from David Mimno:

This usually indicates that you have a large number of topics relative to the size of the corpus. Mallet uses Gibbs sampling, so the topics are based on actual counts of tokens currently assigned to a topic. There's nothing wrong with these "empty" topics per se, as long as you know not to put too much trust in them.

Upvotes: 4

Related Questions