Reputation: 159
I am trying to learn about Latent Dirichlet Allocation (LDA). I have basic knowledge of machine learning and probability theory and based on this blog post http://goo.gl/ccPvE I was able to develop the intuition behind LDA. However I still haven't got complete understanding of the various calculations that goes in it. I am wondering can someone show me the calculations using a very small corpus (let say of 3-5 sentences and 2-3 topics).
Upvotes: 6
Views: 6656
Reputation: 1632
LDA Procedure
Step1: Go through each document and randomly assign each word in the document to one of K topics (K is chosen beforehand)
Step2: This random assignment gives topic representations of all documents and word distributions of all the topics, albeit not very good ones
So, to improve upon them: For each document d, go through each word w and compute:
p(topic t | document d): proportion of words in document d that are assigned to topic t
p(word w| topic t): proportion of assignments to topic t, over all documents d, that come from word w
Step3: Reassign word w a new topic t’, where we choose topic t’ with probability
This generative model predicts the probability that topic t’ generated word w. we will iterate this last step multiple times for each document in the corpus to get steady-state.
Solved calculation
Let's say you have two documents.
Doc i: “The bank called about the money.”
Doc ii: “The bank said the money was approved.”
After removing the stop words, capitalization, and punctuation.
Unique words in corpus:
bank called about money boat approved
Next then,
After then, we will randomly select a word from doc i (word bank with topic assignment 1) and we will remove its assigned topic and we will calculate the probability for its new assignment.
Now we will calculate the product of those two probabilities as given below:
Good fit for both document and word for topic 2 (area is greater) than topic 1. So, our new assignment for word bank will be topic 2.
Now, we will update the count due to new assignment.
Now we will repeat the same step of reassignment. and iterate through each word of the whole corpus.
Upvotes: 1
Reputation: 8172
Edwin Chen (who works at Twitter btw) has an example in his blog. 5 sentences, 2 topics:
Then he does some "calculations"
And take guesses of the topics:
Your question is how did he come up with those numbers? Which words in these sentences carry "information":
Now let's go sentence by sentence getting words from each topic:
So my numbers, differ slightly from Chen's. Maybe he includes the word "piece" in "piece of broccoli" as counting towards food.
We made two calculations in our heads:
Upvotes: 7