Reputation: 339
I am trying to find the optimal threshold T of X to predict Y. I would normally use Youden's J in this setting, however when the threshold is a lower bound (in the case where Y varies inversely to X), the classic implementation does not seem to hold.
The following post has some partial answers (1st answers produces better results), but the method is not reliable according to the comments and no paper is cited: Roc curve and cut off point. Python
def cutoff_youdens_j(fpr, tpr, thresholds):
j_scores = tpr-fpr # J = sensivity (=tpr) + specificity (=1-fpr) - 1
j_ordered = sorted(zip(j_scores, thresholds))
return j_ordered[-1][1]
import numpy as np
from sklearn.metrics import roc_curve
X = np.arange(1, 10)
# Y is an example of a binary dependent variable that varies inversely to the predictor X
Y = X < 5
fpr, tpr, thresholds = roc_curve(Y, X)
T = cutoff_youdens_j(fpr, tpr, thresholds)
print(T)
# OUTPUT: 10
Expected output would be 5
, however I get 10
.
Are there any better methods for optimal threshold selection and is there a paper demonstrating this?
It would also be interesting to get if it actually is a lower or upper bound.
EDIT: A possibility would be the inverse X beforehand and then inverse T.
X = np.arange(1, 10)
Y = X < 5
X = -X
fpr, tpr, thresholds = roc_curve(Y, X)
T = cutoff_youdens_j(fpr, tpr, thresholds)
T = -T
print(T) #OUTPUT 4
This works, but the direction of the association has to be determined beforehand. Are there any other methods that work with both positive and negative associations between X and Y?
Upvotes: 0
Views: 442
Reputation: 7969
Your problem is that the positive class has lower X values. Sklearn assumes higher values for the positive class, otherwise the ROC curve gets inverted, here with an AUC of 0.0:
from sklearn.metrics import roc_auc_score
print(roc_auc_score(Y, X))
# OUTPUT: 0.0
ROC analysis comes from the field of signal detection, and it critically depends on the definition of a positive signal, ie the direction of the comparison. Some libraries can automatically detect that for you, some don't, but in the end it always has to be done.
And so the rest is correct, the "best" threshold in this case is one of the corner of the curve.
Just make sure your positive class is set properly, and you're good to go:
Y = X > 5
Upvotes: 2