Reputation: 11
thanks in advance for your help. Right now, I have run a dataset through the LDA function in Jonathan Chang's 'lda' package (N.B. this is different from the 'topicmodels' package). Below is a replicable example, which uses the cora dataset that comes automatically when you install and load the 'lda' package.
library(lda)
data(cora.documents) #list of words contained in each of the 2,410 documents
data(cora.vocab) #vocabulary list of words that occur at least once across all documents
Thereafter, I conduct the actual LDA by setting the different parameters and running the code.
#parameters for LDA algorithm
K <- 20 #number of topics to be modelled from the corpus, "K"
G <- 1000 #number of iterations to cover - the higher this number, the more likely the data converges, "G"
alpha <- 0.1 #document-topic distributions, "alpha"
beta <- 0.1 #topic-term distributions, "beta/eta"
#creates an LDA fit based on above parameters
lda_fit <- lda.collapsed.gibbs.sampler(cora.documents, cora.vocab, K = 20,
num.iterations = G, alpha, beta)
Following which, we examine one component of the output of the LDA model, which is called document_sums. This component displays the number of words that each individual document contains that is allocated to each of the 20 topics (based on the K-value I chose). For instance, one document may have 4 words allocated to Topic 3, and 12 words allocated to Topic 19, in which case the document is assigned to Topic 19.
#gives raw figures for no. of times each document (column) had words allocated to each of 20 topics (rows)
document_sums <- as.data.frame(lda_fit$document_sums)
document_sums[1:20, 1:20]
However, what I want to do is essentially use the principle of fuzzy membership. Instead of allocating each document to the topic it contains the most words in, I want to extract the probabilities that each document gets allocated to each topic. document_sums quite close to this, but I still have to do some processing on the raw data.
Jonathan Chang, the creator of the 'lda' package, himself says this in this thread:
n.b. If you want to convert the matrix to probabilities just row normalize and add the smoothing constant from your prior. The function here just returns the raw number of assignments in the last Gibbs sampling sweep. ()
Separately, another reply on another forum reaffirms this:
The resulting document_sums will give you the (unnormalized) distribution over topics for the test documents. Normalize them, and compute the inner product, weighted by the RTM coefficients to get the predicted link probability (or use predictive.link.probability)
And thus, my question is, how do I normalise my document_sums and 'add the smoothing constant'? These I am unsure of.
Upvotes: 1
Views: 213
Reputation: 390
As asked: You need to add the prior to the matrix of counts and then divide each row by its total. For example
theta <- document_sums + alpha
theta <- theta / rowSums(theta)
You'll need to do something similar for the matrix of counts relating words to topics.
However if you're using LDA, may I suggest you check out textmineR? It does this normalization (and other useful things) for you. I originally wrote it as a wrapper for the 'lda' package, but have since implemented my own Gibbs sampler to enable other features. Details on using it for topic modeling are in the third vignette
Upvotes: 1