Reputation: 482
I have a data set with around 130000 records. The records divided in two class of target variable,0 & 1. 1 contains only 0.09% of total proportion.
I'm running my analysis in R-3.5.1 on Windows 10. I used SMOTE algorithm to work with this imbalanced data set.
I used following code to handle imbalanced data set
library(DMwR)
data_code$target=as.factor(data_code$target) #Converted to factor as
# SMOTE works with factor data type
smoted_data <- SMOTE(target~., data_code, perc.over=100)
But after executing the code,I'm seeing the count for 0 is 212 & 1 is also 212 which is significant reduction of my sample size.Can you suggest me how do I handle this imbalanced data set with SMOTE without changing my data size
Upvotes: 3
Views: 4409
Reputation: 117
I know I'm a little too late to answer your question but hope this answer would help others! The package you're using is DMwR
which uses a combination of SMOTE and under-sampling of the majority class.
I'd suggest you to use smotefamily::SMOTE
as it only over samples the minority class, so you wouldn't lose your majority class observations.
Upvotes: 0
Reputation: 7151
An alternative to the DMwR
package is the smotefamily
package which does not reduce the sample size.
Instead, it creates additional data (= synthesized data) from the minority class, and adds it to the original data. So the output in the $data
argument is ready for training. To tune the amount of synthesized data, you can modify the parameter dup_size
. However, the default dup_size = 0
already optimizes the output to achieve balanced classes, so you don't need to tune it.
This is greatly explained in this blog post by Richard Richard.
Example code (with features in first two columns):
smote1 <- smotefamily::SMOTE(features, target, K = 4, dup_size = 0)
formula1 <- "class ~ ." %>% as.formula
model.smote <- caret::train(formula1, method = "rpart", smote1$data)
predictions.smote <- predict(model.smote, smote1$data[,1:2]) %>% print
cv2 <- confusionMatrix(smote1$data$class %>% as.factor, predictions.smote)
I find the smotefamily::SMOTE
more convenient because you don't have to tune the two parameters perc_over
and perc_under
until you get an acceptable sample size, and the DMwR::SMOTE
often generates NA values.
Upvotes: 0
Reputation: 8364
You need to play a bit with the two parameters avaiable from the function: perc.over
and perc.under
.
As per the doc from SMOTE
:
The parameters perc.over and perc.under control the amount of over-sampling of the minority class and under-sampling of the majority classes, respectively.
So:
perc.over will tipically be a number above 100. With this type of values, for each case in the orginal data set belonging to the minority class, perc.over/100 new examples of that class will be created
I can't see your data but, if your minority class has 100 cases and perc.over=100
, the algorithm will generate 100/100 = 1 new cases from that class.
The parameter perc.under controls the proportion of cases of the majority class that will be randomly selected for the final "balanced" data set. This proportion is calculated with respect to the number of newly generated minority class cases.
So for example a value of perc.under=100
will select from the majority class on the original data the same amount of observation that have been generated for the minority class.
In our example just 1 new case was generated so it will add just another one, resulting in a new dataset with 2 cases.
I suggest to use values above 100 for perc.over
, and an even higher value for perc.under
(defaults are 100 and 200).
Keep in mind that you're adding new observations that are not real in your minority class, I'd try to keep these under control.
Numeric example:
set.seed(123)
data <- data.frame(var1 = sample(50),
var2 = sample(50),
out = as.factor(rbinom(50, 1, prob=0.1)))
table(data$out)
# 0 1
# 43 7 # 50 rows total (original data)
smote_data <- DMwR::SMOTE(out ~ var1, data, perc.over = 200, perc.under = 400)
table(smote_data$out)
# 0 1
# 56 21 # 77 rows total (smote data)
Upvotes: 4