SMOTE generates duplicate samples for majority sample

Question

I'm trying out some samples using existing SMOTE packages. So I was trying performanceEstimation package, and followed their sample code for SMOTE. Below is the code as reference:

## A small example with a data set created artificially from the IRIS
## data

data <- iris[, c(1, 2, 5)]
data$Species <- factor(ifelse(data$Species == "setosa","rare","common"))

## checking the class distribution of this artificial data set
table(data$Species)

## now using SMOTE to create a more "balanced problem"
newData <- smote(Species ~ ., data, perc.over = 6,perc.under=1)
table(newData$Species)


## Checking visually the created data
## Not run:
par(mfrow = c(1, 2))
plot(data[, 1], data[, 2], pch = 19 + as.integer(data[, 3]),
main = "Original Data")
plot(newData[, 1], newData[, 2], pch = 19 + as.integer(newData[,3]),
main = "SMOTE'd Data")
## End(Not run)

Link: https://cran.r-project.org/web/packages/performanceEstimation/performanceEstimation.pdf

in the result for new data, i noticed that the majority samples are being generated as duplicates. below how the results look like:

..	Sepal.lenght	Sepal.Width	Species
146	6.7	3.0	common
146.1	6.7	3.0	common
106	7.6	3.0	common
60	5.2	2.7	common
107	4.9	2.5	common
107.1	4.9	2.5	common
107.2	4.9	2.5	common

the first column here is the index that you can see when you run "newData", or click on the newData variable in the Environment tab in RStudio. Just a note, the above table is just some snippet that i picked from result. common is the class in iris dataset.

So the question(s) here is,

Why SMOTE generates duplicate samples for the majority sample (common class)?
Will this duplicate sample affect the accuracy of the classification model?

As I understand, SMOTE undersamples majority class and oversamples minority class, and the oversampling portion generates a synthetic sample. the last 3 rows in above table are duplicates.

If you run the code, you will see the rows, indexed as decimals. I've tried to search around, but I couldn't find similar question in any of the forums. another point is that, I've tried other packages,and obtained similar results.

SMOTE generates duplicate samples for majority sample

Answers (1)

Related Questions