Reputation: 5182
I am doing some classification and I was looking at the f1 score and noticed something strange.
When I do:
"f1:" + str(f1_score(y_test_bin, target, average="weighted"))
I get :
f1:0.444444444444
When I do:
print "f1:" + str(f1_score(y_test_bin, target,pos_label=0, average="weighted"))
I get:
f1:0.823529411765
Which is stange since I set the average to be 'weighted'. This should give me the weighted average of those two scores. Which is independent of the "True label"
I can also see this in the classification report:
precision recall f1-score support
0 0.76 0.90 0.82 39
1 0.60 0.35 0.44 17
avg / total 0.71 0.73 0.71 56
In the classification report I get the weighted average, but not when I use the f1 score function. Why is this ?
Upvotes: 1
Views: 2837
Reputation: 1
I was struggling with this problem as well and found a solution after reading eickenberg's answer on this thread which is definitely worth a read for the background on this.
In short, sklearn automatically overrides the averaging to take the positive class score when it interprets the data as binary. It does this automatically or when you specify a pos_label
. The solution then is to redefine the pos_label
as None
.
print "f1:" + str(f1_score(y_test_bin, target, pos_label=None, average="weighted"))
Hope this helps!
Upvotes: 0
Reputation: 14377
The docstring of f1_score
contains a paragraph about this behaviour, although somewhat indirectly
average : string, [None, 'micro', 'macro', 'samples', 'weighted' (default)]
If ``None``, the scores for each class are returned. Otherwise,
unless ``pos_label`` is given in binary classification, this
determines the type of averaging performed on the data:
[...]
``'weighted'``:
Calculate metrics for each label, and find their average, weighted
by support (the number of true instances for each label). This
alters 'macro' to account for label imbalance; it can result in an
F-score that is not between precision and recall.
It says [...]Otherwise, unless pos_label
is given in binary classification, [...], so in binary classification the averaging is overridden and the function just gives back the f1_score
considering pos_label
(which is 1 by default) as the detections.
As mentioned in a comment, this special treatment of binary classification has been discussed in a github issue. The reason it works this way is mostly due to legacy than anything else: Changing this behaviour may be disruptive to many codebases.
Upvotes: 2