Reputation:
I have some comments dataset which I want to classify into five categories :-
jewelries, clothes, shoes, electronics, food & beverages
So if someones talking about pork, steak, wine, soda, eat : its classified into f&b
Whereas if somebodys talking about say - gold, pendent, locket etc : its classified into jewelries
I want to know , what tags/tokens should I be looking for in a comment/tweet so as to classify it into any of these categories. Finally which classifier to use. I just need some guidance and suggestions , Ill take it from there.
Please help. Thanks
Upvotes: 7
Views: 3377
Reputation: 20729
This answer can be a bit long and perhaps I abstract a few things away, but it's just to give you an idea and some advice.
As others already mentioned, in the land of machine learning there are 2 main roads: Supervised and Unsupervised learning. As you probably already know by now, if your corpus(documents) are labeled, you are talking about supervised learning. The labels are the categories and are in this case boolean values. For instance if a text is related to clothes and shoes the labels for those categories should be true.
Since a text can be related to multiple categories (multiple labels), we are looking at multiclassifiers.
I presume that the dataset is not yet labeled, since twitter does not do this categorisation for you. So here comes a big decision on your part.
I've used quite some supervised learning in the past and have had good experience with this type of learning, therefore I will continue explaining this path.
You have to come up with the features that you want to use. For text classification, a good approach is to use each possible word in the document as a feature. A value of True represents if the word is present in the document, false represents absence.
Before doing this, you need to do some preprocessing. This can be done by using various features provided by the NLTK library.
Now if you have preprocessed the data, then generate a featureset for each word that exists in the documents. There exist automatic methods and filters for this, but I'm not sure how to do this in Python.
There are multiple classifiers that you can use for this purpose. I suggest to take a deeper look at the ones that exist and their benefits.You can user the nltk classifier which supports multiclassification, but to be honest I never tried that one before. In the past I've used Logistic Regression and SVM.
Training & testing
You will use a part of your data for training and a part for validating if the trained model performs well. I suggest you to use cross-validation, because you will have a small dataset (you have to manually label the data, which is cumbersome). The benefit of cross-validation is that you don't have to split your dataset in a training set and testing set. Instead it will run in multiple rounds and iterate through the data for a part training data and a part testing data. Resulting in all the data being used at least once in your training data.
Predicting
Once your model is built and the outcome of the predictions on 'test-data' is plausible. You can use your model in the wild to predict the categories of the new Facebook messages/tweets.
The NLTK library is great for preprocessing and natural language processing, but I never used it before for classification. I've heard a lot of great things about the scikit
python library. But to be fair honest, I prefer to use Weka, which is a data mining tool written in java, offering a great UI and which speeds up your task a lot!
In your question you state that you want to classify the dataset into five categories. I would like to show you the idea of topic modelling. It might not be useful in your scenario if you are really only targeting those categories (that's why I leave this part at the end of my answer). However if your goal is to categorise the tweets/fb messages into non-predefined categories, topic modelling is the way to go.
Topic modeling is an unsupervised learning method, where you decide in advance the amount of topics(categories) you want to 'discover'. This number can be high (e.g. 40) Now the cool thing is that the algorithm will find 40 topics that contain words that have something related. It will also output for each document a distribution that indicates to which topics the document is related. This way you can discover a lot more categories than your 5 predefined ones.
Now I'm not gonna go much deeper into this, but just google it if you want more information. In addition you could consider to use MALLET which is an excellent tool for topic modelling.
Upvotes: 7
Reputation: 122052
What you're looking for is in the subject of
Natural Language Processing
(NLP) : processing text data and Machine learning
(where the classification models are built)First I would suggesting going through NLP tutorials and then text classification tutorials, the most appropriate being https://class.coursera.org/nlp/lecture
If you're looking for libraries available in python
or java
, take a look at Java or Python for Natural Language Processing
If you're new to text processing, please take a look at the NLTK
library that provides a nice introduction to doing NLP, see http://www.nltk.org/book/ch01.html
Now to the hard core details:
First, ask yourself whether you have twitter/facebook comments (let's call them documents from now on) that are manually labelled with the categories you want.
1a. If YES, look at supervised machine learning, see http://scikit-learn.org/stable/tutorial/basic/tutorial.html
1b. If NO, look at UNsupervised machine learning, i suggest clustering and topic modelling, http://radimrehurek.com/gensim/
After knowing which kind of machine learning you need, split the documents up into at least training (70-90%) and testing (10-30%) set, see
Note. I suggest at least because there are other ways to split up your documents, e.g. for development or cross-validation. (if you don't understand this, it's all right, just follow step 2)
Finally, Train and Test your model
3a. If supervised, use the training set to train your supervised model. Apply your model onto the test set and then see how well you performed.
3b. If unsupervised, use the training set to generate documents clusters (that means to group similar documents) but they still have no labels. So you need to think of some smart way to label the groups of documents correctly. (To this date, there is no real good solution to this, even super effective neural networks cannot know what the neurons are firing, they just know each neuron is firing something specific)
Upvotes: 2
Reputation: 3060
Well this is kind of a big subject.
You mentioned Python, so you should have a look at the NLTK library which allows you to process natural language, such as your comments.
After this step, you should have a classifier which will map the words you retrieved to a certain class. NTLK also have tools for classification which is linked to knowledge databases. If you are lucky, the categories you are looking for are already available; otherwise you may have to build them yourself. You can have a look at this example which uses NTLK and the WordNet database. You can have access to the Synset, which seems to be pretty broad; and you can also have a look at the hypersets (see for example list(dog.closure(hyper)) ).
Basically you should consider using a multiclassifier on the whole tokenized text (comments on Facebook and tweets are usually short. You might also decide to only consider FB comments below 200 characters, your choice). The choice of a multiclassifier is motivated by non-orthogonality of your classification set (clothes, shoes and jewelries can be the same object; you could have electronic jewelry [ie smartwatches], etc.). This is a fairly simple setup but it's an interesting first step, whose strengths and weaknesses will allow you to iterate easily (if needed).
Good luck!
Upvotes: 2