Reputation: 23
I'm trying to perform document classification into two categories (category1 and category2), using Weka.
I've gathered a training set consisting of 600 documents belonging to both categories and the total number of documents that are going to be classified is 1,000,000.
So to perform the classification, I apply the StringToWordVector filter. I set true the followings from the filter: - IDF transform - TF ransform - OutputWordCounts
I'd like to ask a few questions about this process.
1) How many documents shall I use as training set, so that I over-fitting is avoided?
2) After applying the filter, I get a list of the words in the training set. Do I have to remove any of them to get a better result at the classifier or it doesn't play any role?
3) As classification method I usually choose naiveBayes but the results I get are the followings:
-------------------------
Correctly Classified Instances 393 70.0535 %
Incorrectly Classified Instances 168 29.9465 %
Kappa statistic 0.415
Mean absolute error 0.2943
Root mean squared error 0.5117
Relative absolute error 60.9082 %
Root relative squared error 104.1148 %
----------------------------
and if I use SMO the results are:
------------------------------
Correctly Classified Instances 418 74.5098 %
Incorrectly Classified Instances 143 25.4902 %
Kappa statistic 0.4742
Mean absolute error 0.2549
Root mean squared error 0.5049
Relative absolute error 52.7508 %
Root relative squared error 102.7203 %
Total Number of Instances 561
------------------------------
So in document classification which one is "better" classifier? Which one is better for small data sets, like the one I have? I've read that naiveBayes performs better with big data sets but if I increase my data set, will it cause the "over-fitting" effect? Also, about Kappa statistic, is there any accepted threshold or it doesn't matter in this case because there are only two categories?
Sorry for the long post, but I've been trying for a week to improve the classification results with no success, although I tried to get documents that fit better in each category.
Upvotes: 1
Views: 353
Reputation: 146
Regarding the second question 2) After applying the filter, I get a list of the words in the training set. Do I have to remove any of them to get a better result at the classifier or it doesn't play any role?
I was building a classifier and training it with the famous 20news group dataset, when testing it without the preprocessing the results were not good. So, i pre-processed the data according to the following steps:
These steps are taken from http://web.ist.utl.pt/~acardoso/datasets/
Upvotes: 0
Reputation: 21
1) How many documents shall I use as training set, so that I over-fitting is avoided? \
You don't need to choose the size of training set, in WEKA, you just use the 10-fold cross-validation. Back to the question, machine learning algorithms influence much more than data set in over-fitting problem.
2) After applying the filter, I get a list of the words in the training set. Do I have to remove any of them to get a better result at the classifier or it doesn't play any role? \
Definitely it does. But whether the result get better can not be promised.
3) As classification method I usually choose naiveBayes but the results I get are the followings: \
Usually, to define whether a classify algorithm is good or not, the ROC/AUC/F-measure value is always considered as the most important indicator. You can learn them in any machine learning book.
Upvotes: 2
Reputation: 6271
To answers your questions:
the
, he
or and
) is a common strategy to improve your classifier. Weka's StringToWordVector
allows you to select a file containing these stop words, but it should also have a default list with English stop words.Correctly Classified Instances
). You might also want to take a look at (Lib)SVM or LibLinear (You may need to install them if they are not in Weka natively; Weka 3.7.6 has a package manager allowing for easy installation), which can perform quite well on document classification as well.Upvotes: 1