Vincent
Vincent

Reputation: 8846

Is Choosing a Model based on F1 Score (Computed at threshold = 0.5) equivalent to choosing a model based on Area Under Precision Recall Curve?

https://neptune.ai/blog/f1-score-accuracy-roc-auc-pr-auc provides a good summary on Accuracy vs AUROC vs F1 vs AUPR.

When comparing performances of different models over the same dataset, depending on the use case one might choose Accuracy, AUROC, AUPR, or F1.

One thing I'm not quite clear on though is: "does choosing based on F1 (harmonic mean between precision and recall) over a threshold of 0.5 result in the same choice compared to choosing based on Area Under PR Curve?"

If so, why?

Upvotes: 0

Views: 605

Answers (1)

desertnaut
desertnaut

Reputation: 60370

It is most certainly not, due to a very simple and fundamental reason: AUC scores (either of ROC or PR curves) actually give the performance of the model averaged over a whole range of thresholds; looking closely at the linked document, you'll notice the following regarding the PR AUC (emphasis in the original):

You can also think of PR AUC as the average of precision scores calculated for each recall threshold. You can also adjust this definition to suit your business needs by choosing/clipping recall thresholds if needed.

and you may use PR AUC

when you want to choose the threshold that fits the business problem

The moment you choose any specific threshold (in precision, recall, F1 etc), you have left the realm of the AUC scores (ROC or PR) altogether - you are in a single point on the curve, and the average area under the curve is no longer useful (or even meaningful).

I have argued elsewhere why the AUC scores can be misleading, in the sense that most people think they give something else than what they actually give, i.e. the performance of the model over a whole range of thresholds, while what one is going to deploy (and is thus interested on its performance) will necessarily involve a specific threshold indeed.

Upvotes: 2

Related Questions