Reputation: 173
I have a positive dataset of 239 and negative dataset of 32 in number since its a cancer related data we had only few negative set. Now when applying classification , sure the imbalanced dataset will be biased too much towards the positive because of their huge numbers. So i tried applying SMOTE in weka . I tried various percentage and nearest neighbours too. to my suprise instead of negative class increased a few instances and the positive increased further making the imbalanced dataset too biased. What can be done to overcome this . And suggest me some other methods too ?? if available
For initial studies we used LIBSVM with RBF as classifier
Upvotes: 2
Views: 4660
Reputation: 1061
In this imbalanced dataset problem, I suggest to make use of stratification, which involves over-sampling the minority class or down-sampling the majority class. You can simulate stratification in WEKA making use of cost sensitive classification.
You can make use of two classifiers, MetaCost and CostSensitiveClassifier. The only issue is that the optimal values in the cost matrix can only be obtained by experimenting. As a rule of thumb, you can try to balance the class distribution by using weights that are inverse to the class distribution. In your case, this means assigning a cost of 239 to false positives and a weight of 32 to false negatives in the cost matrix.
Upvotes: 2