Reputation: 11

Number of documents for Latent Dirichlet Allocation (LDA)

Thanks for taking the time to look at this question. I recently scraped some text from the web and saved the output as one .txt file of about ~300 pages. I am trying to implement LDA to build topics and am familiar with the technical aspects of doing it.

However, my question is whether it matters for LDA to use one file or multiple ones (ie. if I am examining 200 emails, do I need to tokenize, remove stopwords and puncuation, and stem the big file and then save each email as a separate .txt file before implementing LDA or can I do it all in the one file?

The problem I am facing right now is that the pre-processing of the document would take ages if I were to break everything up into separate .txt files. Any suggestions? Many thanks.

Upvotes: 1

Answers (2)

Samir Alajmovic

Reputation: 3283

Well, it matters because the idea with LDA is to determine document-topic and topic-word distribution. So it goes against the whole concept of finding the topic-word distribution probabilities, which in essence tells us the probability of word w being generated by topic t.

If we only have one document then there is no distinction between topics because every word will occur in the same document.

Upvotes: 1

Ben

Reputation: 42293

This is a coding site, and since you don't have any code in your question, you're not really asking a question suitable for this site. That might be why you haven't got any answers until now.

That said, you can input your single text file into R and then pre-process each document within that text file and generate topic models. I've tried it both ways, with one giant file of many docs and many small files of one doc each. I found the difference in processing speed is very small.

Upvotes: 2

Number of documents for Latent Dirichlet Allocation (LDA)

Answers (2)

Related Questions