Reputation: 303
Do we need to set sample_weight when we evaluate our model? Now I have trained a model about classification, but the dataset is unbalanced. When I set the sample_weight with compute_sample_weight('balanced'), the scores are very nice. Precision:0.88, Recall:0.86 for '1' class. But the scores will be bad if I don't set the sample_weight. Precision:0.85, Recall:0.21. Will the sample_weight destroy the original data distribution?
Upvotes: 7
Views: 8374
Reputation: 491
Here is my understanding: The sample_weight have nothing to do with balanced or unbalanced on itself, it is just a way to reflect the distribution of the sample data. So basically the following two way of expression is equivalent, and expression 1 is definitely more efficient in terms of space complexity. This 'sample_weight' is just the same as any other statistical package in any language and have nothing about the random sampling
expression 1
X = [[1,1],[2,2]]
y = [0,1]
sample_weight = [1000,2000] # total 3000
versus
expression 2
X = [[1,1],[2,2],[2,2],...,[1,1],[2,2],[2,2]] # total 300 rows
y = [0,1,1,...,0,1,1]
sample_weight = [1,1,1,...,1,1,1] # or just set as None
Upvotes: 0
Reputation: 1084
The sample-weight parameter is only used during training.
Suppose you have a dataset with 16 points belonging to class "0" and 4 points belonging to class "1".
Without this parameter, during optimization, they have a weight of 1 for loss calculation: they contribute equally to the loss that the model is minimizing. That means that 80% of the loss is due to points of class "0" and 20% is due to points of class "1".
By setting it to "balanced", scikit-learn will automatically calculate weights to assign to class "0" and class "1" such that 50% of the loss comes from class "0" and 50% from class "1".
This paramete affects the "optimal threshold" you need to use to separate class "0" predictions from class "1", and also influences the performance of your model.
Upvotes: 0