Reputation: 129
I have two classes in my data.
This is how class distribution looks like.
0.0 169072
1.0 84944
In other words, I have 2:1 class distribution.
I believe I have two choices. Downsample the class 0.0
or upsample class 1.0
. If I go with option 1, I'm losing data. If i go with option 2, then I'm using non-real data.
Is there a way, I can train the model without upsample or downsample?
This is how my classification_report looks like.
precision recall f1-score support
0.0 0.68 1.00 0.81 51683
1.0 1.00 0.00 0.00 24522
accuracy 0.68 76205
macro avg 0.84 0.50 0.40 76205
weighted avg 0.78 0.68 0.55 76205
Upvotes: 0
Views: 517
Reputation: 432
Your data is slightly imbalanced yes, but it does not mean that you only have one of the two options (under or over sample your data). You can leave the data as is and apply cost sensitive training in your model. For example, if in your case the classes have a match of 2:1
, then you need to give a weight of 2
to your minority class. In the example of an XGBoost classifier, this argument is called scale_pos_weight
. See more in this excellent tutorial.
Regarding model evaluation, you should use a classification report to have a full intuition of your model's true and false predictions (precision and recall are your two best friends in this process!).
Upvotes: 2
Reputation: 308763
I would not recommend either approach.
I'm thinking about models to detect fraud. By definition, fraud should be a small percentage of outcomes - on the order of 1-5%. Changing the percentage for training would be a gross distortion of the problem being solved.
Better to leave the proportions as they are.
Make sure that your train, validation, and test data sets all have ratios that reflect the real problem.
Adjust your outcome instead. Don't go for accuracy. A naive model that assumes the 0 outcome will be correct 2/3rds of the time. You want your model to be better than that or a weighted coin flip.
I'd recommend using recall as your criterion for success.
Upvotes: 1