Reputation: 12131
So, I'm trying to do text multiclass classification. I have been reading a lot of old questions and blog posts, but I still can't fully understand the concept of that.
I tried some example from this blog post as well. http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/
But when it comes to multiclass classification I don't quite understand that. Let's say I want to classify text into multi languages, French, English, Italian and German. And I want to use NaviesBayes which I think it would be the easiest to start with. From what I have read in the old questions, the simplest solution would be to use one vs all. So, each language will have its own model. So, I would have 3 models for French, English and Italian. Then I would run a text against every model and check if which one has the highest probability. Am I correct?
But when it comes to coding, in the example above he has tweets like this which will be classified either positive or negative.
pos_tweets = [('I love this car', 'positive'),
('This view is amazing', 'positive'),
('I feel great this morning', 'positive'),
('I am so excited about tonight\'s concert', 'positive'),
('He is my best friend', 'positive')]
neg_tweets = [('I do not like this car', 'negative'),
('This view is horrible', 'negative'),
('I feel tired this morning', 'negative'),
('I am not looking forward to tonight\'s concert', 'negative'),
('He is my enemy', 'negative')]
Which it's positive or negative. So, when it comes to train one model for French how should I tag the text? Would it be like this? So this would be the positive?
[('Bon jour', 'French'),
'je m'appelle', 'French']
And the negative would be
[('Hello', 'English'),
('My name', 'English')]
But would this mean I could just add Italian and German and have just one model for 4 languages? Or I don't really need the negative?
So, the question would be what's the right approach to do multi class classification with ntlk?
Upvotes: 6
Views: 5491
Reputation: 1
Classifiers in NLTK (http://www.nltk.org/api/nltk.classify.html) can come in several variants and it is important to understand the subtle difference.
The simplest variant is the distinction between two categories, e.g. positive versus negative sentiment, male versus female. (http://www.nltk.org/api/nltk.classify.html#module-nltk.classify.positivenaivebayes)
The second variant is when you have several categories (two or more), e.g. text in French, German or English and you assume that every text uses exactly one language. Note that the language in NLTK does not describe this as a "multiclass" which can be understandably misleading when you are new to this. Just think of it this way. The classifier will not assign one text to multiple classes, e.g. German and French, but only to a single class.
Finally there is the Multiclassifier which is different in that a given input can be assigned to more than one class, e.g. 50% French and 50% German or 40% English, 30% German and 30% French.
Upvotes: 0
Reputation: 363487
There's no need for a one-vs-all scheme with Naive Bayes -- it's a multiclass model out of the box. Just feed a list of (sample, label)
pairs to the classifier learner where label
denotes the language.
Upvotes: 9