Reputation: 1
I'm trying out some samples using existing SMOTE packages. So I was trying performanceEstimation
package, and followed their sample code for SMOTE. Below is the code as reference:
## A small example with a data set created artificially from the IRIS
## data
data <- iris[, c(1, 2, 5)]
data$Species <- factor(ifelse(data$Species == "setosa","rare","common"))
## checking the class distribution of this artificial data set
table(data$Species)
## now using SMOTE to create a more "balanced problem"
newData <- smote(Species ~ ., data, perc.over = 6,perc.under=1)
table(newData$Species)
## Checking visually the created data
## Not run:
par(mfrow = c(1, 2))
plot(data[, 1], data[, 2], pch = 19 + as.integer(data[, 3]),
main = "Original Data")
plot(newData[, 1], newData[, 2], pch = 19 + as.integer(newData[,3]),
main = "SMOTE'd Data")
## End(Not run)
Link: https://cran.r-project.org/web/packages/performanceEstimation/performanceEstimation.pdf
in the result for new data, i noticed that the majority samples are being generated as duplicates. below how the results look like:
.. | Sepal.lenght | Sepal.Width | Species |
---|---|---|---|
146 | 6.7 | 3.0 | common |
146.1 | 6.7 | 3.0 | common |
106 | 7.6 | 3.0 | common |
60 | 5.2 | 2.7 | common |
107 | 4.9 | 2.5 | common |
107.1 | 4.9 | 2.5 | common |
107.2 | 4.9 | 2.5 | common |
the first column here is the index that you can see when you run "newData", or click on the newData variable in the Environment tab in RStudio. Just a note, the above table is just some snippet that i picked from result. common is the class in iris dataset.
So the question(s) here is,
As I understand, SMOTE undersamples majority class and oversamples minority class, and the oversampling portion generates a synthetic sample. the last 3 rows in above table are duplicates.
If you run the code, you will see the rows, indexed as decimals. I've tried to search around, but I couldn't find similar question in any of the forums. another point is that, I've tried other packages,and obtained similar results.
Upvotes: 0
Views: 789
Reputation: 78
To first question: If your goal is to achieve a balanced dataset, you do not have to oversample the majority class. e.g. class 1 is 100 and class 2 is 50 (like in your iris example) ... you can only oversample class 2 from 50 to 100 and leave class one unchanged. Using performanceEstimation::smote
you can do:
newData <- performanceEstimation::smote(Species ~ ., data, perc.over = 1,perc.under=2, k =10)
table(newData$Species)
This results in 100 rare class and 100 common class.
No matter if you balance by undersampling or oversampling or both: Balancing will definitely lead to better results, if you are interested especially in the minority classes (like fraud detection, outlier detection, side effect of drugs in patients etc.).Otherwise your model is biased towards the majority class leading to low accuracy and precision for the minority class you are interested in.
To your second question: Note the k
parameter I used from the package documentation. k determines, how many nearest neighbors will be used for a kind of interpolation. If you look into newData, you will notice, that many of the oversampled cases of the MINORITY (or rare) class (and not the majority class - or common class you showed in your table - which are duplicates since sampled with replacement) are interpolated. The radius of neighbors used for this interpolation is determined by k:
301 4.660510 3.278980 rare
311 4.757001 3.142999 rare
321 5.432725 3.432725 rare
You can see the interpolation from those numbers are numeric and not integers (thus not rounded) like in the original dataset. I suggest you play around with the parameter k and inspect the plot (as you also did) before and after smote.
One note: The other smote package DMwR::SMOTE
has the same parameters and functional structure and share the same logic for the perc.over
and perc.under
parameters and lead to almost same result. Here is the same data example that should leed to similar balanced iris data set 100/100 again:
newData <- DMwR::SMOTE(Species ~ ., data, perc.over = 100,perc.under=200, k =10)
table(newData$Species)
Note that perc.over and perc.under have same logic ... but should be interpreted as 100 % and 200% (versus in performanceEstimation package 1 and 2 means the same).
One final note to the smotefamily
package. I can only recommend this last package if you want to do dbsmote and adasyn. No hassle with dbsmote and adasyn. But I can not recommend doing smotefamily::SMOTE, because the syntax is really a pain. It requires a "numeric-attributed" dataframe or matrix. The example of the documentation uses a data generator ... that "magically" generates the correct object, but leaves the reader alone in reproducing it. But that is just a side note ... to have all packages DMwR, performanceEstimation, and smotefamily in one comment :)
Upvotes: 0