Doron Cohen
Doron Cohen

Reputation: 245

big number of attributes best classifiers

I have dataset which is built from 940 attributes and 450 instance and I'm trying to find the best classifier to get the best results. I have used every classifier that WEKA suggest (such as J48, costSensitive, combinatin of several classifiers, etc..) The best solution I have found is J48 tree with accuracy of 91.7778 % and the confusion matrix is:

394  27 |   a = NON_C
 10  19 |   b = C

I want to get better reuslts in the confution matrix for TN and TP at least 90% accuracy for each. Is there something that I can do to improve this (such as long time run classifiers which scans all options? other idea I didn't think about? Here is the file:

https://googledrive.com/host/0B2HGuYghQl0nWVVtd3BZb2Qtekk/

Please help!!

Upvotes: 0

Views: 1310

Answers (2)

aplassard
aplassard

Reputation: 759

There are quite a few things you can do to improve the classification results.

First, it seems that your training data is severly imbalanced. By training with that imbalance you are creating a significant bias in almost any classification algorithm

Second, you have a larger number of features than examples. Consider using L1 and/or L2 regularization to improve the quality of your results.

Third, consider projecting your data into a lower dimension PCA space, say containing 90 % of the variance. This will remove much of the noise in the training data.

Fourth, be sure you are training and testing on different portions of your data. From your description it seems like you are training and evaluating on the same data, which is a big no no.

Upvotes: 1

xhudik
xhudik

Reputation: 2442

I'd guess that you got a data set and just tried all possible algorithms...

Usually, it is a good to think about the problem:

  1. to find and work only with relevant features(attributes), otherwise the task can be noisy. Relevant features = features that have high correlation with class (NON_C,C).

  2. your dataset is biased, i.e. number of NON_C is much higher than C. Sometimes it can be helpful to train your algorithm on the same portion of positive and negative (in your case NON_C and C) examples. And cross-validate it on natural (real) portions

  3. size of your training data is small in comparison with the number of features. Maybe increasing number of instances would help ...

    ...

Upvotes: 1

Related Questions