nagendra
nagendra

Reputation: 1965

How to form documents for LDA on twitter data

We have a requirement to do topic modelling on the twitter tweets on the live stream, the input makes to spark streaming and stores the data to HDFS. A batch job runs on the collected data. The batch job is to find the underlying topics in the tweets. For this we are using Latent Dirichlet Allocation (LDA) alogrithm to find out the topics. We receive data as tweets of max characters 140 and are stored as one row in HDFS.

I'm new to the LDA algorithm and have basic understanding on that, as the topic model are derived based on word co-occurrences across n documents

I understood two options to input the data to the LDA.

Option 1: Use one row tweet as one single document for the LDA ?.

Option 2: Group the rows and form documents pass these documents to LDA ?.

I want to understand how the distribution of the vocabulary(words) to topic is effected for each option. Which option should be considered for better topic modelling.

Also please let me know if any better solution is required to do topic modelling on the twitter data other than these otpions.

Note: When I ran the both options and displayed on the word cloud, I could see the distribution of words to the topics(3) is different for the both.

Any help appreciated.

Thanks in advance.

Upvotes: 1

Views: 1190

Answers (1)

ML_TN
ML_TN

Reputation: 727

Using LDA with short document is a bit tricky since LDA assign a topic per word and multiple topic for each document. Doing it with short text means that few words will belong to a same topic, though mostly a tweet will contain only one topic, which will usually yield garbage topics distribution. (This is your option 1)

I know that there's a paper and java tool for topic modeling for short text but I have never used it. Here's the to the github repo link

For option 2, I think it will be possible to use LDA and get coherent topics but you need to find some semantic structure for grouping, i.e. per source, date, keyword, hashtag ..

I will be really interested by the results you get if you apply any of the proposed options any soon.

Upvotes: 3

Related Questions