Reputation: 43
The target attribute distribution is currently like this:
mydata.groupBy("Churn").count().show()
+-----+-----+
|Churn|count|
+-----+-----+
| 1| 483|
| 0| 2850|
+-----+-----+
My questions are:
methods of oversampling like: manully, smote, adasyn are going to use available data to create new data points?
If we use such data to train a classification model, will it not be an overfitted one?
Upvotes: 2
Views: 1907
Reputation: 808
my question is any method of oversampling (manully, smote, adasyn) will use the available data to create new data points.
SMOTE (Synthetic Minority Over-sampling TEchnique) is coming under the third step. It’s the process of creating a new minority classes from the datasets.
The process in SMOTE is mentioned below:
So, this is a bit smarter than just over-sampling.
If we use such data to build a classification model, will it not be an overfitted one?
The correct answer would be PROBABLY. Give it a try!
This is why we use test sets and cross validation to try to understand if the model would be good with unseen data!
Upvotes: 1