Jérôme R
Jérôme R

Reputation: 1257

classifying documents in multiple categories

I wrote a classification program based on the Naive Bayes algorithm which works pretty well for classifying one document into only one category. However my context has changed and I now need to classify a document into N categories.

Basically I need to switch from a 'spam'/'non-spam' classification to a 'spam and poker', 'spam and something', 'non spam'.

I thought about 2 options:

  1. tweaking the algorithm and getting the possible categories sorted by their probability. This could work but it does not seem right to me. What do you think ?

  2. Using a completely different algorithm, in that case which one would you recommend ?

Thanks in advance for your feedback :)

Upvotes: 3

Views: 2457

Answers (3)

Fred Foo
Fred Foo

Reputation: 363547

Since your classes are not disjoint, this is multi-label classification. There's support for that in the scikit-learn package using the simple one-vs.-rest rule (aka binary relevance): for each of the decisions spam/non-spam, poker/non-poker, etc. a separate classifier is trained and at prediction time, each is run independently on the test samples.

Smarter approaches include classifier chains.

(Disclaimer: I wrote parts of the multi-label classification code in sklearn, so this is not unbiased advice.)

Upvotes: 1

Ben Allison
Ben Allison

Reputation: 7394

There's no reason not to extend the Naive Bayes to multiple categories---in fact it's a simple classifier that naturally extends to the multi category case. If your categories "spam and poker", "spam and something", "not spam" are entirely disjoint, you can treat it as a single three way classification task: if you have categories c_1, c_2 and c_3 with prior probabilities p_1, p_2 and p_3, and likelihoods (probabilities of instances given classes) l_1, l_2 and l_3, then the posterior probability of the class is proportional to its prior times its likelihood (the normaliser is just the sum p_1*l_1 + p_2*l_2 + p_3*l_3). This is equally true for any number of classes.

However, I suspect you may find better performance in practice by first deciding whether it's spam or not, and then determining the type of spam (a two stage classification process).

Tom Mitchell's book "Machine Learning" is a pretty accessible introduction to this stuff if you happen to have access to it.

Upvotes: 4

Exceptiondev
Exceptiondev

Reputation: 121

We use J48 Algorithm (http://de.wikipedia.org/wiki/J48) from the Weka Library http://www.cs.waikato.ac.nz/~ml/weka/ and it's working great!

Also Lingpipe is working great http://alias-i.com/lingpipe/

Both is very easy to implement and is working out of the box.

Upvotes: 1

Related Questions