I have class X with 1000 observations and class Y with 2000 observations. I am trying to decide which classification evaluation metric is most appropriate here and why. Precision Recall Curve. AUC ROC Simple accuracy metric Confusion matrix and classification report. I am tempted to stick with option 4 since it is not an imbalanced set IMO and we don't need to use precision recall curve. Please kindly elaborate on what is appropriate here.

machine-learningclassificationmetricsevaluationimbalanced-data

Reputation: 9

Which evaluation metric will be suitable for a classification problem with an imbalanced dataset?

I have class X with 1000 observations and class Y with 2000 observations. I am trying to decide which classification evaluation metric is most appropriate here and why.

Precision Recall Curve.
AUC ROC
Simple accuracy metric
Confusion matrix and classification report.

I am tempted to stick with option 4 since it is not an imbalanced set IMO and we don't need to use precision recall curve. Please kindly elaborate on what is appropriate here.

Upvotes: 0

Answers (1)

Barbara Gendron

Reputation: 445

My very concise answer would be : all of them are useful because they all give you a different insight about your classifier. Now let me elaborate a bit more.

ROC and AUC

First about the classification metrics you mention. The precision recall curve is usually called the ROC curve, and the AUC ROC is simply the area under this curve. Thus, you two first points are strongly related and we actually need both to have qualitative and quantitative insights about performance. What is particularly interesting with ROC curve is that it gives you the couple (precision, recall) for each and every classification threshold, which basically is a super synthetic way to store meaningful information.

Accuracy, precision, recall, F1

Then, you are mentioning accuracy, which is actually the most basic classification metric. It is of course interesting to display it, but you should mind the fact that accuracy is sometimes biased : imagine a situation when you are doing binary classification and one class occurs only in 10% of the cases. Then, a model that always predicts the other class is 90% accurate! That's why accuracy is useful when compared to other metrics, typically precision, recall and F1 score. I elaborate a quick overview of these three:

precision is the fraction of retrieved instances that are relevant, which means they needed to be retrieved.
recall is the fraction of relevant instances that are retrieved.
F1 score can be seen as the harmonic mean of precision and recall. It is a very interesting way of combining both that's why I advise you to look at this metric and display it in your report. Nevertheless, you should also have a look at precision and recall individually to ensure there is nothing pathologic in them.

Classification report and confusion matrix

Finally, let's elaborate about the last point you mentioned: confusion matrix and classification report. According to what I wrote above, elaborating about classification report would be completely redundant because these previous metrics actually correspond to what you can find in the classification report. Then, the only thing I can point about that is that you probably need to display or even log a classification report each time you run your model.

Now about confusion matrix, this will simply give you the true/false negative/positive in each case, which is the basic material from which the above mentioned metrics are computed. Thus, it is interesting to have a look as well to detect anomalies, but it is not enough since we can make many relevant computations from it.

A relevant choice : MCC

The Mathew correlation coefficient (MCC) computes the correlation between true and predicted true entries. This would be particularly relevant in the case you have class inbalance, which seem not to be your case but I still mention it as a further word to end this (probably too long) elaboration.

Upvotes: 1