Reputation: 2062
I wish to write a script that will parse a users tweets and classify it into previously specified category. For example:
"Ed Miliband will lose election if he is 'seduced' by Blairites, says union chief http://bit.ly/145CRAD"
will classify in domain Politics.
"Dear Sachin, you're 40. Buy a sports car, have flings with 20 yr old blondes. Enjoy your midlife crisis. Leave IPL for the boys - your fan"
will classify in domain Cricket.
What is the best way to do this?
Upvotes: 1
Views: 3064
Reputation: 279
How about lda? topic model!
you can try online-lda in python
http://www.cs.princeton.edu/~blei/topicmodeling.html
then if you want try distributed lda(more fast)
you can try light-lda
Upvotes: 0
Reputation: 18349
This is a complex problem in the field of Natural Language Processing (NLP) called document classification. One of the best open source libraries out there is maintained by The Stanford NLP Group. Good luck!
Upvotes: 1
Reputation: 39277
You are looking for a 'Topic Model'. Techniques include Latent Dirichlet Allocation and others. The Wikipedia article includes links to resources such as Mallet which should help you.
You didn't specify what language you wanted to use nor what 'best' means? Easiest to implement, fastest, or best results?
Another alternative is to use humans (e.g. Amazon Mechanical Turk) which may give you the 'best' results for tweets which are notoriously hard to classify given all the abbreviations, sarcasm, and hash tags ... #notAnEasyProblem.
Upvotes: 4
Reputation: 930
This papers would be a nice point to start looking... http://dl.acm.org/citation.cfm?id=1835643 http://www.tmrfindia.org/ijcsa/v9i15.pdf
Upvotes: 1