Reputation: 93

R topicmodels LDA

I am running LDA on a small corpus of 2 docs (sentences) for testing purposes. Following code returns topic-term and document-topic distributions that are not reasonable at all given the input documents. Running exactly the same returns in Python reasonable results. Who knows what is wrong here?

library(topicmodels)
library(tm)

d1 <- "bank bank bank"
d2 <- "stock stock stock"

corpus <- Corpus(VectorSource(c(d1,d2)))

##fit lda to data
dtm <- DocumentTermMatrix(corpus)
ldafit <- LDA(dtm, k=2, method="Gibbs") 

##get posteriors
topicTerm <- t(posterior(ldafit)$terms)
docTopic <- posterior(ldafit)$topics
topicTerm
docTopic

> topicTerm
              1         2
bank  0.3114525 0.6885475
stock 0.6885475 0.3114525
> docTopic
          1         2
1 0.4963245 0.5036755
2 0.5036755 0.4963245

The results from Python are as follows:

>>> docTopic
array([[ 0.87100799,  0.12899201],
       [ 0.12916713,  0.87083287]])
>>> fit.print_topic(1)
u'0.821*"bank" + 0.179*"stock"'
>>> fit.print_topic(0)
u'0.824*"stock" + 0.176*"bank"'

Upvotes: 2

Answers (2)

schimo

Reputation: 93

The author of the R package topicmodels, Bettina Grün, pointed out that this is due to the selection of the hyperparameter 'alpha'.

LDA in R selects alpha = 50/k= 25 while LDA in gensim Python selects alpha = 1/k = 0.5. A smaller alpha value favors sparse solutions of document-topic distributions, i.e. documents contain mixture of just a few topics. Hence, decreasing alpha in LDA in R yields very reasonable results:

ldafit <- LDA(dtm, k=2, method="Gibbs", control=list(alpha=0.5)) 

posterior(ldafit)$topics
#    1     2
# 1  0.125 0.875
# 2  0.875 0.125

posterior(ldafit)$terms
#   bank    stock
# 1 0.03125 0.96875
# 2 0.96875 0.03125

Upvotes: 3

Nelly Kong

Reputation: 289

Try to plot the perplexity over iterations and make sure they converge. Initial status also matters. (The document size and sample size both seem to be small, though.)

Upvotes: 0

R topicmodels LDA

Answers (2)

Related Questions