WonderWomen
WonderWomen

Reputation: 433

What is the difference between sample weight and class weight options in scikit learn?

I have class imbalance problem and want to solve this using cost sensitive learning.

  1. under sample and over sample
  2. give weights to class to use a modified loss function

Question

Scikit learn has 2 options called class weights and sample weights. Is sample weight actually doing option 2) and class weight options 1). Is option 2) the the recommended way of handling class imbalance.

Upvotes: 31

Views: 18576

Answers (3)

schotti
schotti

Reputation: 175

assume you have 5 samples of which the first two are of class A and the last three of class B. to "achieve balance" you can then assign weights either to the two classes (e.g. class_weight=[0.6, 0.4]) or to the five samples (e.g. sample_weight=[0.25, 0.25, 0.167, 0.167, 0.167]). sample_weight thus allows a more fine-grained weighting at the level of samples instead of classes.

if you use both weightings then the actual weight of a sample will be the product of its sample_weight with the class_weight of its class, and you don't usually want that. also, some will say that assigning weights to samples in order to balance classes is conceptually awkward (in particular in multilabel classification where the same sample can belong to both a very frequent and a sparsely populated class, so what sample weight would you assign?) or at least unnecessarily complicated. so you would usually deal with imbalanced classes via class_weight and only use sample_weight "on top" if you want to additionally change individual weights of specific samples, e.g. in order to "zero" specific samples.

as far as i know, not all functions or metrics in sklearn offer both arguments. if you have a binary classifier and really need to balance classes via sample_weight because the metric doesn't accept class_weight (not sure if this scenario exists at all, but just in case...) then you can get sample weights for balanced classes using sample_weights = compute_sample_weight(class_weight='balanced', y=y_true) with compute_sample_weight from sklearn.utils.class_weight.

Upvotes: 0

Ibraim Ganiev
Ibraim Ganiev

Reputation: 9390

It's similar concepts, but with sample_weights you can force estimator to pay more attention on some samples, and with class_weights you can force estimator to learn with attention to some particular class. sample_weight=0 or class_weight=0 basically means that estimator doesn't need to take into consideration such samples/classes in learning process at all. Thus classifier (for example) will never predict some class if class_weight = 0 for this class. If some sample_weight/class_weight bigger than sample_weight/class_weight on other samples/classes - estimator will try to minimize error on that samples/classes in the first place. You can use user-defined sample_weights and class_weights simultaneously.

If you want to undersample/oversample your training set with simple cloning/removing - this will be equal to increasing/decreasing of corresponding sample_weights/class_weights.

In more complex cases you can also try artificially generate samples, with techniques like SMOTE.

Upvotes: 19

ldirer
ldirer

Reputation: 6756

sample_weight and class_weight have a similar function, that is to make your estimator pay more attention to some samples.

Actual sample weights will be sample_weight * weights from class_weight.

This serves the same purpose as under/oversampling but the behavior is likely to be different: say you have an algorithm that randomly picks samples (like in random forests), it matters whether you oversampled or not.

To sum it up:
class_weight and sample_weight both do 2), option 2) is one way to handle class imbalance. I don't know of an universally recommended way, I would try 1), 2) and 1) + 2) on your specific problem to see what works best.

Upvotes: 8

Related Questions