aquarian47
aquarian47

Reputation: 43

Will oversampling lead to an overfitted model?

The target attribute distribution is currently like this:

mydata.groupBy("Churn").count().show()

+-----+-----+
|Churn|count|
+-----+-----+
|    1|  483|
|    0| 2850|
+-----+-----+

My questions are:

Upvotes: 2

Views: 1907

Answers (1)

Francesco Pegoraro
Francesco Pegoraro

Reputation: 808

my question is any method of oversampling (manully, smote, adasyn) will use the available data to create new data points.

  • Data imbalance problems is mostly handled in three steps:
    1. Over-sample the minority class.
    2. Under-sample the majority class.
    3. Synthesize new minority classes.

SMOTE (Synthetic Minority Over-sampling TEchnique) is coming under the third step. It’s the process of creating a new minority classes from the datasets.

The process in SMOTE is mentioned below:

enter image description here

So, this is a bit smarter than just over-sampling.

If we use such data to build a classification model, will it not be an overfitted one?

The correct answer would be PROBABLY. Give it a try!

This is why we use test sets and cross validation to try to understand if the model would be good with unseen data!

Upvotes: 1

Related Questions