A. Gehani
A. Gehani

Reputation: 93

Plotting ROC curve for different threshold values python

I am using MLP for audio classification. The following code is used to plot the ROC curve and obtain the optimal threshold values:

# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
optimal_idx = dict()
optimal_threshold = dict()
for i in range(num_labels):
        fpr[i], tpr[i], thres = roc_curve(Y_test[:, i], Y_pred[:, i])
        roc_auc[i] = auc(fpr[i], tpr[i])
        # print(thres)
        optimal_idx[i] = np.argmax(tpr[i] - fpr[i])
        optimal_threshold[i] = thres[optimal_idx[i]]
        print(f'Threshold value for class{i}:', optimal_threshold[i])
      
        # Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], thres = roc_curve(Y_test.ravel(), Y_pred.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
  
    # Compute macro-average ROC curve and ROC area
    # First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(num_labels)]))

      # Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
  
for i in range(num_labels):
      mean_tpr += np.interp(all_fpr, fpr[i], tpr[i])
    
    # Finally average it and compute AUC
mean_tpr /= num_labels

fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])
      # Plot all ROC curves
lw=2
plt.figure()
plt.plot(
      fpr["micro"],
      tpr["micro"],
      label="micro-average ROC curve (area = {0:0.2f})".format(roc_auc["micro"]),
      color="deeppink",
      linestyle=":",
      linewidth=4,)

plt.plot(
      fpr["macro"],
      tpr["macro"],
      label="macro-average ROC curve (area = {0:0.2f})".format(roc_auc["macro"]),
      color="navy",
      linestyle=":",
      linewidth=4,)
# from itertools import cycle
colors = cycle(["aqua", "darkorange", "cornflowerblue"])
for i, color in zip(range(num_labels), colors):
      plt.plot(
          fpr[i],
          tpr[i],
          color=color,
          lw=lw,
          label="ROC curve of class {0} (area = {1:0.2f})".format(i, roc_auc[i]),
      )

plt.plot([0, 1], [0, 1], "k--", lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC for MLP")
plt.legend(loc="lower right")
plt.show()

It works fine and provides me the optimal threshold values for all the classes in my dataset but I am unable to plot ROC for a range of user-defined threshold values. Is there a way to plot the ROC for different set of thresholds?

Upvotes: 0

Views: 803

Answers (1)

mhenning
mhenning

Reputation: 1833

You can use

from sklearn import metrics
fpr, tpr, thresholds = metrics.roc_curve(y, scores, drop_intermediate=False)

with the drop_intermediate=False parameter, you explicitly get every threshold value. The number of thresholds are len(np.unique(scores))+1), with one threshold being np.inf, and the rest of the thresholds are unique score values. Then you can just choose the threshold range you want to use by slicing the arrays at the appropiate threshold value.
Because the threshold values are the unique score values, a threshold value between two score values would just be the same as the threshold to the bigger of the two score values.

Edit: taking the example from the sklearn page for roc_curve, and exapanding it a bit to force more thresholds:

y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])
scores = np.array([0.1, 0.2, 0.1, 0.45, 0.35, 0.4, 0.8, 0.8, 1.0])
fpr, tpr, thresholds = roc_curve(y, scores, pos_label=2)
print(thresholds.shape)  # would print (5,)
fpr, tpr, thresholds = roc_curve(y, scores, pos_label=2, drop_intermediate=False)
print(thresholds.shape)  # would print (8,) because we have 2 times 0.8 score

So we have 9 labels, with 3 classes, and for each label, the model gave a probability estimate for the target class (you would have to let the model know which class is the positive class).
The docs state that "scores" should be "probability estimates of the positive class, confidence values, or non-thresholded measure of decisions", so they are not estimated class labels, but probabilities for the class estimation. So for every label in y, a model gives you a probability that it think this label is label 2 (in this example).
Also, if you have more than 2 classes, or your classes are not labeled {-1, 1} or {0, 1}, you have to give the posivive label as an argument.

The values fpr (false positvie rate), tpr (true positives rate) and thresholds are:

fpr = [0., 0., 0., 0.16666667, 0.33333333, 0.5, 0.66666667, 1.]
tpr = [0., 0.33333333, 0.66666667, 1., 1., 1., 1., 1.]
thresholds = [inf, 1., 0.8, 0.45, 0.4, 0.35, 0.2, 0.1]

So, what does all these numbers tell you? First of all, you have different probabilities that values belong to the target class (in this case class 2, because we set it with pos_label=2 and would have let the model know). They are ordered in descending order for the thresholds only, the other 2 are ordered ascending. So, for example, if you choose a high threshold of 0.8, you get a tpr of 1.0 (great, you found all positives), but a fpr of 0.5 (not so great, half of other labels are labeled as 2). If you choose threshold 0.35 (third, when counted from last to first in thresholds), you get a tpr of 0.666 (third value in tpr), and a fpr of 0.(third value in fpr). With this low probability decision threshold, you get 2/3 of the correct 2 labels, but no wrong ones.
The threshold values are coming from the score values, because this are the fixed probabilities for the labels. We have no label prediction with e.g. the probability 0.3, so 0.3 is no threshold. If you would set 0.3 as an arbitrary threshold, it would be the same as the threshold 0.35.

The threshold values from your comment tell you the probability thresholds for the respective false posivie rate and true positive rate that roc_curve gives you as returns.

Upvotes: 0

Related Questions