strange F1 score result using scikit learn

Question

I am doing some classification and I was looking at the f1 score and noticed something strange.

When I do:

"f1:" + str(f1_score(y_test_bin, target, average="weighted"))

I get :

f1:0.444444444444

When I do:

print "f1:" + str(f1_score(y_test_bin, target,pos_label=0, average="weighted"))

I get:

f1:0.823529411765

Which is stange since I set the average to be 'weighted'. This should give me the weighted average of those two scores. Which is independent of the "True label"

I can also see this in the classification report:

         precision    recall  f1-score   support

      0       0.76      0.90      0.82        39
      1       0.60      0.35      0.44        17

avg / total       0.71      0.73      0.71        56

In the classification report I get the weighted average, but not when I use the f1 score function. Why is this ?

eickenberg · Accepted Answer

The docstring of f1_score contains a paragraph about this behaviour, although somewhat indirectly

average : string, [None, 'micro', 'macro', 'samples', 'weighted' (default)]
    If ``None``, the scores for each class are returned. Otherwise,
    unless ``pos_label`` is given in binary classification, this
    determines the type of averaging performed on the data:

[...]

     ``'weighted'``:
        Calculate metrics for each label, and find their average, weighted
        by support (the number of true instances for each label). This
        alters 'macro' to account for label imbalance; it can result in an
        F-score that is not between precision and recall.

It says [...]Otherwise, unless pos_label is given in binary classification, [...], so in binary classification the averaging is overridden and the function just gives back the f1_score considering pos_label (which is 1 by default) as the detections.

As mentioned in a comment, this special treatment of binary classification has been discussed in a github issue. The reason it works this way is mostly due to legacy than anything else: Changing this behaviour may be disruptive to many codebases.

strange F1 score result using scikit learn

Answers (2)

Related Questions