user737128
user737128

Reputation: 159

Latent Dirichlet Allocation Solution Example

I am trying to learn about Latent Dirichlet Allocation (LDA). I have basic knowledge of machine learning and probability theory and based on this blog post http://goo.gl/ccPvE I was able to develop the intuition behind LDA. However I still haven't got complete understanding of the various calculations that goes in it. I am wondering can someone show me the calculations using a very small corpus (let say of 3-5 sentences and 2-3 topics).

Upvotes: 6

Views: 6656

Answers (2)

Anil Sah
Anil Sah

Reputation: 1632

LDA Procedure

Step1: Go through each document and randomly assign each word in the document to one of K topics (K is chosen beforehand)

Step2: This random assignment gives topic representations of all documents and word distributions of all the topics, albeit not very good ones

So, to improve upon them: For each document d, go through each word w and compute:

  • p(topic t | document d): proportion of words in document d that are assigned to topic t

  • p(word w| topic t): proportion of assignments to topic t, over all documents d, that come from word w

Step3: Reassign word w a new topic t’, where we choose topic t’ with probability

  • p(topic t’ | document d) * p(word w | topic t’)

This generative model predicts the probability that topic t’ generated word w. we will iterate this last step multiple times for each document in the corpus to get steady-state.

Solved calculation

Let's say you have two documents.

Doc i: “The bank called about the money.

Doc ii: “The bank said the money was approved.

After removing the stop words, capitalization, and punctuation.

Unique words in corpus: bank called about money boat approved enter image description here Next then,

enter image description here After then, we will randomly select a word from doc i (word bank with topic assignment 1) and we will remove its assigned topic and we will calculate the probability for its new assignment.

enter image description here

For the topic k=1 enter image description here

For the topic k=2 enter image description here

Now we will calculate the product of those two probabilities as given below: enter image description here

Good fit for both document and word for topic 2 (area is greater) than topic 1. So, our new assignment for word bank will be topic 2.

Now, we will update the count due to new assignment. enter image description here

Now we will repeat the same step of reassignment. and iterate through each word of the whole corpus. enter image description here

Upvotes: 1

john mangual
john mangual

Reputation: 8172

Edwin Chen (who works at Twitter btw) has an example in his blog. 5 sentences, 2 topics:

  • I like to eat broccoli and bananas.
  • I ate a banana and spinach smoothie for breakfast.
  • Chinchillas and kittens are cute.
  • My sister adopted a kitten yesterday.
  • Look at this cute hamster munching on a piece of broccoli.

Then he does some "calculations"

  • Sentences 1 and 2: 100% Topic A
  • Sentences 3 and 4: 100% Topic B
  • Sentence 5: 60% Topic A, 40% Topic B

And take guesses of the topics:

  • Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, …
    • at which point, you could interpret topic A to be about food
  • Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, …
    • at which point, you could interpret topic B to be about cute animals

Your question is how did he come up with those numbers? Which words in these sentences carry "information":

  • broccoli, bananas, smoothie, breakfast, munching, eat
  • chinchilla, kitten, cute, adopted, hampster

Now let's go sentence by sentence getting words from each topic:

  • food 3, cute 0 --> food
  • food 5, cute 0 --> food
  • food 0, cute 3 --> cute
  • food 0, cute 2 --> cute
  • food 2, cute 2 --> 50% food + 50% cute

So my numbers, differ slightly from Chen's. Maybe he includes the word "piece" in "piece of broccoli" as counting towards food.


We made two calculations in our heads:

  • to look at the sentences and come up with 2 topics in the first place. LDA does this by considering each sentence as a "mixture" of topics and guessing the parameters of each topic.
  • to decide which words are important. LDA uses "term-frequency/inverse-document-frequency" to understand this.

Upvotes: 7

Related Questions