Reputation: 433
I have class imbalance problem and want to solve this using cost sensitive learning.
Question
Scikit learn has 2 options called class weights and sample weights. Is sample weight actually doing option 2) and class weight options 1). Is option 2) the the recommended way of handling class imbalance.
Upvotes: 31
Views: 18576
Reputation: 175
assume you have 5 samples of which the first two are of class A and the last three of class B. to "achieve balance" you can then assign weights either to the two classes (e.g. class_weight=[0.6, 0.4]
) or to the five samples (e.g. sample_weight=[0.25, 0.25, 0.167, 0.167, 0.167]
). sample_weight
thus allows a more fine-grained weighting at the level of samples instead of classes.
if you use both weightings then the actual weight of a sample will be the product of its sample_weight
with the class_weight
of its class, and you don't usually want that. also, some will say that assigning weights to samples in order to balance classes is conceptually awkward (in particular in multilabel classification where the same sample can belong to both a very frequent and a sparsely populated class, so what sample weight would you assign?) or at least unnecessarily complicated. so you would usually deal with imbalanced classes via class_weight
and only use sample_weight "on top" if you want to additionally change individual weights of specific samples, e.g. in order to "zero" specific samples.
as far as i know, not all functions or metrics in sklearn
offer both arguments. if you have a binary classifier and really need to balance classes via sample_weight
because the metric doesn't accept class_weight
(not sure if this scenario exists at all, but just in case...) then you can get sample weights for balanced classes using sample_weights = compute_sample_weight(class_weight='balanced', y=y_true)
with compute_sample_weight
from sklearn.utils.class_weight
.
Upvotes: 0
Reputation: 9390
It's similar concepts, but with sample_weights you can force estimator to pay more attention on some samples, and with class_weights you can force estimator to learn with attention to some particular class. sample_weight=0 or class_weight=0 basically means that estimator doesn't need to take into consideration such samples/classes in learning process at all. Thus classifier (for example) will never predict some class if class_weight = 0 for this class. If some sample_weight/class_weight bigger than sample_weight/class_weight on other samples/classes - estimator will try to minimize error on that samples/classes in the first place. You can use user-defined sample_weights and class_weights simultaneously.
If you want to undersample/oversample your training set with simple cloning/removing - this will be equal to increasing/decreasing of corresponding sample_weights/class_weights.
In more complex cases you can also try artificially generate samples, with techniques like SMOTE.
Upvotes: 19
Reputation: 6756
sample_weight
and class_weight
have a similar function, that is to make your estimator pay more attention to some samples.
Actual sample weights will be sample_weight * weights from class_weight
.
This serves the same purpose as under/oversampling but the behavior is likely to be different: say you have an algorithm that randomly picks samples (like in random forests), it matters whether you oversampled or not.
To sum it up:
class_weight
and sample_weight
both do 2), option 2) is one way to handle class imbalance. I don't know of an universally recommended way, I would try 1), 2) and 1) + 2) on your specific problem to see what works best.
Upvotes: 8