Reputation: 2478
I have a multi-class classification problem for which I am trying to use a Random Forest classifier. The target is heavily unbalanced and has the following distribution-
1 34108
4 6748
5 2458
3 132
2 37
7 11
6 6
Now, I am using the "class_weight" parameter for RandomForest classifier, and from what I understand, the weights associated with the classes are in the form of {class_label: weight}
So, is the following the correct way:
rfc = RandomForestClassifier(n_estimators = 1000, class_weight = {1:0.784, 2: 0.00085, 3: 0.003, 4: 0.155, 5: 0.0566, 6: 0.00013, 7: 0.000252})
Thanks for your help!
Upvotes: 2
Views: 24843
Reputation: 640
You should give more weight to the classes with less data. Say if you have 7 possible labels (1, 2, 3, 4, 5, 6, 7) and, for example, you want the model to pay twice more attention to class 6 and 7 you could define your dict_weights as:
dict_weights = {1:1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 2, 7: 2}
rfc = RandomForestClassifier(n_estimators = 1000, class_weight=dict_weights)
You could also define the weights to be inversely proportional to the amount of each class in the training data, but that will possibly lead to the model overestimating the 6s and 7s and making a lot of wrong predictions for the 1s and 4s in your dataset.
Upvotes: 2
Reputation: 36
Your way is not correct. You need to give less weight to the class with a large number of samples. Also, you can use BalancedRandomForest
from from imblearn.ensemble import BalancedRandomForestClassifier
Upvotes: 0
Reputation: 549
If you choose class_weight = "balanced"
, the classes will be weighted inversely proportional to how frequently they appear in the data.
In your example, you are weighting the over-represented classes more heavily than the under-represented classes. I believe this is the opposite of what you want to achieve.
A basic formula to calculate the weight of each class is total observations / (number of classes * observations in class)
.
Upvotes: 4