Reputation: 1333
My classifier produces soft classifications and I wish to select an optimal threshold (that is, one that maximizes accuracy) from the results of the method on the training cases, and use this threshold to produce the hard classification. While in general the problem is relatively easy, I find it hard to optimise the code so that the computation does not last forever. Below you'll find the code that essentially recreates the optimisation procedure on some dummy data. Could you please point me into any direction which could possibly improve performance?
y_pred = np.random.rand(400000)
y_true = np.random.randint(2, size=400000)
accs = [(accuracy_score(y_true, y_pred > t), t) for t in np.unique(y_pred)]
train_acc, train_thresh = max(accs, key=lambda pair: pair[0])
I realize that I could sort both y_pred
and y_true
prior to the loop, and use that to my advantage when binarizing y_pred
but that didn't bring much improvement (unless I did something wrong).
Any help would be much appreciated.
Upvotes: 2
Views: 2987
Reputation: 419
I wrote a helper function in python:
def opt_threshold_acc(y_true, y_pred):
A = list(zip(y_true, y_pred))
A = sorted(A, key=lambda x: x[1])
total = len(A)
tp = len([1 for x in A if x[0]==1])
tn = 0
th_acc = []
for x in A:
th = x[1]
if x[0] == 1:
tp -= 1
else:
tn += 1
acc = (tp + tn) / total
th_acc.append((th, acc))
return max(th_acc, key=lambda x: x[1])
Upvotes: 0
Reputation: 18379
Sort y_pred
descendantly and use Kadane's Algorithm to calculate an index i
such that the subarray of y_true
from 0 to i
has maximum sum. Your optimal threshold b
is then b = (y_pred[i] + y_pred[i+i]) / 2
. This will be the same output that SVM would give you, that is, the hyperplane (or for your 1-dimensional case, a threshold) that maximizes the margin between classes.
Upvotes: 2