Olivier_s_j
Olivier_s_j

Reputation: 5182

strange F1 score result using scikit learn

I am doing some classification and I was looking at the f1 score and noticed something strange.

When I do:

"f1:" + str(f1_score(y_test_bin, target, average="weighted"))

I get :

f1:0.444444444444

When I do:

print "f1:" + str(f1_score(y_test_bin, target,pos_label=0, average="weighted"))

I get:

f1:0.823529411765

Which is stange since I set the average to be 'weighted'. This should give me the weighted average of those two scores. Which is independent of the "True label"

I can also see this in the classification report:

         precision    recall  f1-score   support

      0       0.76      0.90      0.82        39
      1       0.60      0.35      0.44        17

avg / total       0.71      0.73      0.71        56

In the classification report I get the weighted average, but not when I use the f1 score function. Why is this ?

Upvotes: 1

Views: 2837

Answers (2)

verhoevenben
verhoevenben

Reputation: 1

I was struggling with this problem as well and found a solution after reading eickenberg's answer on this thread which is definitely worth a read for the background on this.

In short, sklearn automatically overrides the averaging to take the positive class score when it interprets the data as binary. It does this automatically or when you specify a pos_label. The solution then is to redefine the pos_label as None.

print "f1:" + str(f1_score(y_test_bin, target, pos_label=None, average="weighted"))

Hope this helps!

Upvotes: 0

eickenberg
eickenberg

Reputation: 14377

The docstring of f1_score contains a paragraph about this behaviour, although somewhat indirectly

average : string, [None, 'micro', 'macro', 'samples', 'weighted' (default)]
    If ``None``, the scores for each class are returned. Otherwise,
    unless ``pos_label`` is given in binary classification, this
    determines the type of averaging performed on the data:

[...]

     ``'weighted'``:
        Calculate metrics for each label, and find their average, weighted
        by support (the number of true instances for each label). This
        alters 'macro' to account for label imbalance; it can result in an
        F-score that is not between precision and recall.

It says [...]Otherwise, unless pos_label is given in binary classification, [...], so in binary classification the averaging is overridden and the function just gives back the f1_score considering pos_label (which is 1 by default) as the detections.

As mentioned in a comment, this special treatment of binary classification has been discussed in a github issue. The reason it works this way is mostly due to legacy than anything else: Changing this behaviour may be disruptive to many codebases.

Upvotes: 2

Related Questions