samsamara
samsamara

Reputation: 4750

Input documents to LDA

Assume I have N text documents and I run LDA in the following 2 ways,

I'm aware of what number of topics to choose as well; in the first case i can select N to be the number of topics (assuming each document is about a single topic) but if I run it on each document separately not sure how to select the number of topics as well...?

What's going on in these two cases?

Upvotes: 1

Views: 1594

Answers (2)

Yas
Yas

Reputation: 21

LDA is a statistical model that predicts or assigns topics to documents, it works by distributing the words of each document over topics, (randomly the first time) then repeats this step a number of iterations (could be 500 iterations) until the words that are assigned to the topics are almost stable, now it can assign N topics to a document according to the most frequent words in the document that has a high probability in the topic.

so it does not make sense to run it over one document since the words that is assigned to the topic in the first iteration will not change over iterations because you are using only one document, and the topics that is assigned to document will be meaningless

Upvotes: 0

nick_w
nick_w

Reputation: 14948

Latent Dirichlet Allocation is intended to model the topic and word distributions for each document in a corpus of documents.

Running LDA over all of the documents in the corpus at once is the normal approach; running it on a per-document basis is not something I've heard of. I wouldn't recommend doing this. It's difficult to say what would happen, but I wouldn't expect the results to be near as useful because you couldn't meaningfully compare one document/topic or topic/word distribution with another.

I'm thinking that your choice of N for the number of topics might be too high (what if you had thousands of documents in your corpus?), but it really depends on the nature of the corpus you are modelling. Remember that LDA assumes a document will be a distribution over topics, so it might be worth rethinking the assumption that each document is about one topic.

Upvotes: 4

Related Questions