How to improve performance of SMO classifier in weka?

Question

I am using weka SMO classifier for classify the documents.There are many parameters for smo available like Kernal, tolerance etc.., I tested using different parameters but i not get good result large data set.

For more than 90 category only 20% documents getting correctly classified.

Please anyone tell me the best set of parameter to get highest performance in SMO.

ffriend · Accepted Answer

Principal issue here is not classification itself, but rather selecting suitable features. Using raw HTML leads to very large noise which in its turn makes classification results very poor. Thus, to get good results do the following:

Extract relevant text. Not just remove HTML tags, but get exactly the text describing item.
Create dictionary of key words. E.g. capuccino, latte, white rice, etc.
Use stemming or lemmatization to get word's base form and avoid counting, for example, "cotton" and "cottons" as 2 different words.
Make feature vectors from text. Attributes (feature names) should be all words from your dictionary. Values may be: binary (1 if word occurs in text, 0 otherwise), integer (number of occurrences of word in question in text), tf-idf (use this one if your texts have very different lengths) and others.
And only after all these steps you can use classifer.

Most probably classifier type won't play a big role here: dictionary-based features normally lead to quite exact results regardless of classification technique in use. You can use SVM (SMO), Naive Bayes, ANN or even kNN. More sophisticated methods include creation of category hierarchy, where, for example, category "coffee" is included into category "drinks" which in its turn is part of category "food".

How to improve performance of SMO classifier in weka?

Answers (1)

Related Questions