Will oversampling lead to an overfitted model?

Question

The target attribute distribution is currently like this:

mydata.groupBy("Churn").count().show()

+-----+-----+
|Churn|count|
+-----+-----+
|    1|  483|
|    0| 2850|
+-----+-----+

My questions are:

methods of oversampling like: manully, smote, adasyn are going to use available data to create new data points?
If we use such data to train a classification model, will it not be an overfitted one?

Francesco Pegoraro · Accepted Answer

my question is any method of oversampling (manully, smote, adasyn) will use the available data to create new data points.

Data imbalance problems is mostly handled in three steps:
1. Over-sample the minority class.
2. Under-sample the majority class.
3. Synthesize new minority classes.

SMOTE (Synthetic Minority Over-sampling TEchnique) is coming under the third step. It’s the process of creating a new minority classes from the datasets.

The process in SMOTE is mentioned below:

So, this is a bit smarter than just over-sampling.

If we use such data to build a classification model, will it not be an overfitted one?

The correct answer would be PROBABLY. Give it a try!

This is why we use test sets and cross validation to try to understand if the model would be good with unseen data!

Will oversampling lead to an overfitted model?

Answers (1)

Related Questions