Reputation: 29581
I have taken a look and try out the scikit-learn's tutorial on its Multinomial naive bayes classifier.
I want to use it to classify text documents, and the catch about the NB is that it treats its P(document|label) as a product of all its independent features (words). Right now, I need to try out doing 3 trigram classifier whereby the P(document|label) = P(wordX|wordX-1,wordX-2,label) * P(wordX-1|wordX-2,wordX-3, label).
Where scikit learn supports anything I can implement this language model and extend the NB classifier to perform classification based on this?
Upvotes: 2
Views: 2837
Reputation: 28856
CountVectorizer
will extract trigrams for you (using ngram_range=(3, 3)
). The text feature extraction documentation introduces this. Then, just use MultinomialNB
exactly like before with the transformed feature matrix.
Note that this is actually modeling:
P(document | label) = P(wordX, wordX-1, wordX-2 | label) * P(wordX-1, wordX-2, wordX-3 | label) * ...
How different is that? Well, that first term can be written as
P(wordX, wordX-1, wordX-2 | label) = P(wordX | wordX-1, wordX-2, label) * P(wordX-1, wordX-2 | label)
Of course, all the other terms can be written that way too, so you end up with (dropping the subscripts and the conditioning on the label for brevity):
P(X | X-1, X-2) P(X-1 | X-2, X-3) ... P(3 | 2, 1) P(X-1, X-2) P(X-2, X-3) ... P(2, 1)
Now, P(X-1, X-2) can be written as P(X-1 | X-2) P(X-2). So if we do that for all those terms, we have
P(X | X-1, X-2) P(X-1 | X-2, X-3) ... P(3 | 2, 1) P(X-1 | X-2) P(X-2 | X-3) ... P(2 | 1) P(X-2) P(X-1) ... P(1)
So this is actually like using trigrams, bigrams, and unigrams (though not estimating the bigram/unigram terms directly).
Upvotes: 5