Reputation: 525
I am lost in the relationship between the recall value and area under the precision-recall curve. I am using a binary classifier to classify an imbalanced dataset. I recorded the recall value and the area under the precision-recall curve value by the default sklearn python package. In my case, class 1 is the rare class and class 0 is the other class.
I trained two models. The second model I used oversampling method to balance the training dataset. The testing dataset has never been touched in both models. Here are the recorded values of the testing dataset.
Training with imbalanced dataset.
recall: 0.629, auc of precision-recall: 0.8828
Training with balanced dataset
recall: 0.8426, auc of precision-recall: 0.884
My questions are:
why the recall value improved but the precision-recall curve does not change?
Can I say that if I choose an appropriate threshold, the first model is as good as the second one? Which evaluation matrix should I focus on?
Can I safely say this model is tolerant to imbalanced dataset since the area under precision-curve does not change much?
Upvotes: 0
Views: 195
Reputation: 692
why the recall value improved but the precision-recall curve does not change?
Try to plot the precision-recall curve for both pieces of training in the same graph. then you will surely understand why this happens. and for clarification what is the class [1 or 0] for the case of TP?. If it is 1 then it makes sense because the class balancing generally, increases FP value also, we do call balancing to increase the recall value for the rare class in the distribution.
Can I say that if I choose an appropriate threshold, the first model is as good as the second one?
It is one of the reasons we do confusion matrics analysis. especially in the binary classification case, it calculates loss related to the boundaries [1 and 0] not related to the actual class separation value.so, you can check recall & recall-precision AUC value with class separation value.
Which evaluation matrix should I focus on?
It depends on your domain of the case; choose the best model considering sensitivity and specificity value required.
Upvotes: 0