Reputation: 33
I have just done oversampling in my dataset using Smote, included in DMwR package.
My dataset is formed by two classes. The original distribution is 12 vs 62. So, I have coded this oversampling:
newData <- SMOTE(Score ~ ., data, k=3, perc.over = 400,perc.under=150)
Now, the distribution is 60 vs 72.
However, when I display the 'newData' dataset I discover how SMOTE has made an oversampling and there are some samples repeated.
For example, the sample number 24 appears as 24.1, 24.2 and 24.3.
Is this correct? This affects directly in classification because the classifier will learn a model with data that it will be present in test, so this is not legal in classification.
Edit: I think I didn't explain correctly my issue:
As you know, SMOTE is a technique to oversample. It creates new samples from the original ones, modifying the values of the features for it. However, when I display my new data generated by SMOTE, I obtain this:
(these values are the values of the features) Sample50: 1.8787547 0.19847987 -0.0105946940 4.420207 4.660536 1.0936388 0.5312777 0.07171645 0.008043167
Sample 50.1: 1.8787547 0.19847987 -0.0105946940 4.420207 4.660536 1.0936388 0.5312777 0.07171645
Sample 50 belongs to the original dataset. Sample 50.1 is the 'artificial' sample generated by SMOTE. However (and this is my issue), SMOTE has created a repeated sample, instead of creating a artificial one modifying 'a bit' the values of the features.
I hope you can understand me.
Thanks!
Upvotes: 2
Views: 6541
Reputation: 169
SMOTE is a very simple algorithm to generate synthetic samples. However, before you go ahead and start to use it, you have to understand your features. For example, should each of your features vary the same, etc.
Simply put, before you do, try to understand your data...!!!
Upvotes: 0
Reputation: 81
Smote is an algorithm that generates synthetic examples of a given class (the minority class) to handle imbalanced distributions. This strategy for generating new data is then combined with random under-sampling of the majority class. When you use SMOTE in package DMwR you need to specify an over-sampling percentage and also an under-sampling percentage. These value must be set carefully because the obtained distribution of the data may remain imbalanced.
In your case, and given the parameters set, namely the percentage of under- and over-sampling smote will introduce replicas of the examples of your minority class.
Your initial class distribution is 12 to 62 and after applying smote you end with 60 to 72. This means that the minority class was oversampled with smote and new synthetic examples of this class where produced.
However, your majority class which had 62 examples, now contains 72! The under sampling percentage was applied to this class but it actually increased the number of examples. Since the number of examples to select from the majority class is determined based on the examples of the minority class, the number of examples sampled from this class was larger then the already existing.
Therefore, you had 62 examples and the algorithm tried to randomly select 72! This means that some replicas of the examples of the majority class where introduced.
So, to explain the over-sampling and under-sampling you selected:
12 examples from the minority class with 400% of oversampling gives: 12*400/100=48. So, 48 new synthetic examples where added to the minority class (12+48=60 the final number of examples for the minority class).
The number of examples to select from the majority class are: 48*150/100=72. But the majority class only has 62, so replicas are necessarily introduced.
Upvotes: 5
Reputation: 385
I'm not sure about the implementation of SMOTE in DMwR but it should be safe for you to round the new data to the nearest integer value. One guess is that this is left for you to do in the off chance that you want to do regression instead of classification. Otherwise if you wanted regression and SMOTE returned integers, you'll have unintentionally lost information by going in the opposite direction (SMOTE -> integers -> reals).
If you are not familiar with what SMOTE does is, it creates 'new data' by looking at nearest neighbors to establish a neighborhood and then sampling from within that neighborhood. It is usually done when there is insufficient data in a classification problem for a given class. It operates on the assumptions that data near your data is similar because of proximity.
Alternately you can use Weka's implementation of SMOTE which does not make you do this additional work.
Upvotes: 0