how to perfom classfication

I'm trying to perform document classification into two categories (category1 and category2), using Weka.

I've gathered a training set consisting of 600 documents belonging to both categories and the total number of documents that are going to be classified is 1,000,000.

So to perform the classification, I apply the StringToWordVector filter. I set true the followings from the filter: - IDF transform - TF ransform - OutputWordCounts

I'd like to ask a few questions about this process.

1) How many documents shall I use as training set, so that I over-fitting is avoided?

2) After applying the filter, I get a list of the words in the training set. Do I have to remove any of them to get a better result at the classifier or it doesn't play any role?

3) As classification method I usually choose naiveBayes but the results I get are the followings:

-------------------------
Correctly Classified Instances         393               70.0535 %
Incorrectly Classified Instances       168               29.9465 %
Kappa statistic                          0.415 
Mean absolute error                      0.2943
Root mean squared error                  0.5117
Relative absolute error                 60.9082 %
Root relative squared error            104.1148 %
----------------------------

and if I use SMO the results are:

------------------------------
Correctly Classified Instances         418               74.5098 %
Incorrectly Classified Instances       143               25.4902 %
Kappa statistic                          0.4742
Mean absolute error                      0.2549
Root mean squared error                  0.5049
Relative absolute error                 52.7508 %
Root relative squared error            102.7203 %
Total Number of Instances              561     
------------------------------

So in document classification which one is "better" classifier? Which one is better for small data sets, like the one I have? I've read that naiveBayes performs better with big data sets but if I increase my data set, will it cause the "over-fitting" effect? Also, about Kappa statistic, is there any accepted threshold or it doesn't matter in this case because there are only two categories?

Sorry for the long post, but I've been trying for a week to improve the classification results with no success, although I tried to get documents that fit better in each category.

Upvotes: 1

Answers (3)

G.Ahmed

Reputation: 146

Regarding the second question 2) After applying the filter, I get a list of the words in the training set. Do I have to remove any of them to get a better result at the classifier or it doesn't play any role?

I was building a classifier and training it with the famous 20news group dataset, when testing it without the preprocessing the results were not good. So, i pre-processed the data according to the following steps:

Substitute TAB, NEWLINE and RETURN characters by SPACE.
Keep only letters (that is, turn punctuation, numbers, etc. into SPACES).
Turn all letters to lowercase.
Substitute multiple SPACES by a single SPACE.
The title/subject of each document is simply added in the beginning of the document's text.
no-short Obtained from the previous file, by removing words that are less than 3 characters long. For example, removing "he" but keeping "him".
no-stop Obtained from the previous file, by removing the 524 SMART stopwords. Some of them had already been removed, because they were shorter than 3 characters.
stemmed Obtained from the previous file, by applying Porter's Stemmer to the remaining words. Information about stemming can be found here.

These steps are taken from http://web.ist.utl.pt/~acardoso/datasets/

Upvotes: 0

ylqfp

Reputation: 21

1) How many documents shall I use as training set, so that I over-fitting is avoided? \

You don't need to choose the size of training set, in WEKA, you just use the 10-fold cross-validation. Back to the question, machine learning algorithms influence much more than data set in over-fitting problem.

2) After applying the filter, I get a list of the words in the training set. Do I have to remove any of them to get a better result at the classifier or it doesn't play any role? \

Definitely it does. But whether the result get better can not be promised.

3) As classification method I usually choose naiveBayes but the results I get are the followings: \

Usually, to define whether a classify algorithm is good or not, the ROC/AUC/F-measure value is always considered as the most important indicator. You can learn them in any machine learning book.

Upvotes: 2

Sicco

Reputation: 6271

To answers your questions:

I would use (10 fold) cross-validation to evaluate your method. The model is trained trained 10 times on 90% of the data and tested on 10% of the data using different parts of the data each time. The results are therefor less biased towards your current (random) selection of train and test set.
Removing stop words (i.e., frequently occurring words with little discriminating value like the, he or and) is a common strategy to improve your classifier. Weka's StringToWordVector allows you to select a file containing these stop words, but it should also have a default list with English stop words.
Given your results, SMO performs the best of the two classifiers (e.g., it has more Correctly Classified Instances). You might also want to take a look at (Lib)SVM or LibLinear (You may need to install them if they are not in Weka natively; Weka 3.7.6 has a package manager allowing for easy installation), which can perform quite well on document classification as well.

Upvotes: 1

how to perfom classfication

Answers (3)

Related Questions