Reputation: 493
This question was already asked a year ago on StackExchange/Stats, but it was tagged as off-topic and closed without answer.
As a result my question is the same: is there a Python (scikit-learn or other) implementation of Cost Curves, as described in Cost curves: An improved method for visualizing classifier performance? If not, how could I implement it, considering ground truth labels, predictions, and optionally misclassification costs?
This method plots the performance (normalized expected cost) over operating points (a probability cost function based on the probability of correctly classifying a positive sample).
In the case where misclassification costs of positive and negative samples are all equal to 1, the performance corresponds the error rate, whereas an operating point is the probability of an example being from the positive class.
Upvotes: 2
Views: 2614
Reputation: 493
I worked on it, and I think I've got a working implementation.
import numpy as np
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
# %% INPUTS
# C(-|+)
cost_fn = <a scalar value>
# C(+|-)
cost_fp = <a scalar value>
# Ground truth
truth = <a list of 0 (negative class) or 1 (positive class)>
# Predictions from a classifier
score = <a list of [0,1] class probabilities>
# %% OUTPUTS
# 1D-array of x-axis values (normalized PC)
pc = None
# list of lines as (slope, intercept)
lines = []
# lower envelope of the list of lines as a 1D-array of y-axis values (NEC)
lower_envelope = []
# area under the lower envelope (the smaller, the better)
area = None
# %% COMPUTATION
# points from the roc curve, because a point in the ROC space <=> a line in the cost space
roc_fpr, roc_tpr, _ = roc_curve(truth, score)
# compute the normalized p(+)*C(-|+)
thresholds = np.arange(0, 1.01, .01)
pc = (thresholds*cost_fn) / (thresholds*cost_fn + (1-thresholds)*cost_fp)
# compute a line in the cost space for each point in the roc space
for fpr, tpr in zip(roc_fpr, roc_tpr):
slope = (1-tpr-fpr)
intercept = fpr
lines.append((slope, intercept))
# compute the lower envelope
for x_value in pc:
y_value = min([slope*x_value+intercept for slope, intercept in lines])
lower_envelope.append(max(0, y_value))
lower_envelope = np.array(lower_envelope)
# compute the area under the lower envelope using the composite trapezoidal rule
area = np.trapz(lower_envelope, pc)
# %% EXAMPLE OF PLOT
# display each line as a thin dashed line
for slope, intercept in lines:
plt.plot(pc, slope*pc+intercept, color="grey", lw=1, linestyle="--")
# display the lower envelope as a thicker black line
plt.plot(pc, lower_envelope, color="black", lw=3, label="area={:.3f}".format(area))
# plot parameters
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05*max(lower_envelope)])
plt.xlabel("Probability Cost Function")
plt.ylabel("Normalized Expected Cost")
plt.title("Cost curve")
plt.legend(loc="lower right")
plt.show()
Example of result using cost_fn=cost_fp=1
, breast cancer dataset and the scores of a Gaussian Naive Bayes classifier:
Upvotes: 2