BitByter GS
BitByter GS

Reputation: 999

Stemming in Text Classification - Degrades Accuracy?

I am implementing a text classification system using Mahout. I have read stop-words removal and stemming helps to improve accuracy of Text classification. In my case removing stop-words giving better accuracy, but stemming is not helping much. I found 3-5% decrease in accuracy after applying stemmer. I tried with porter stemmer and k-stem but got almost same result in both the cases.

I am using Naive Bayes algorithm for classification.

Any help is greatly appreciated in advance.

Upvotes: 3

Views: 5402

Answers (1)

ffriend
ffriend

Reputation: 28492

First of all, you need to understand why stemming normally improve accuracy. Imagine following sentence in a training set:

He played below-average football in 2013, but was viewed as an ascending player before that and can play guard or center.

and following in a test set:

We’re looking at a number of players, including Mark

First sentence contains number of words referring to sports, including word "player". Second sentence from test set also mentions player, but, oh, it's in plural - "players", not "player" - so for classifier it is a distinct, unrelated variable.

Stemming tries to cut off details like exact form of a word and produce word bases as features for classification. In example above, stemming could shorten both words to "player" (or even "play") and use them as the same feature, thus having more chances to classify second sentence as belonging to "sports" class.

Sometimes, however, these details play important role by themselves. For example, phrase "runs today" may refer to a runner, while "long running" may be about phone battery lifetime. In this case stemming makes classification worse, not better.

What you can do here is to use additional features that can help to distinguish between different meanings of same words/stems. Two popular approaches are n-grams (e.g. bigrams, features made of word pairs instead of individual words) and part-of-speech (POS) tags. You can try any combination of them, e.g. stems + bigrams of stems, or words + bigrams of words, or stems + POS tags, or stems, bigrams and POS tags, etc.

Also, try out other algorithms. E.g. SVM uses very different approach than Naive Bayes, so it can catch things in data that NB ignores.

Upvotes: 6

Related Questions