Reputation: 698
I am working on a multi-class classification problem, it contains some class imbalance (100 classes, a handful of which only have 1 or 2 samples associated).
I have been able to get a LinearSVC (& CalibratedClassifierCV) model to achieve ~98% accuracy, which is great.
The problem is that for all of the misclassified predictions - the business will incur a monetary loss. That is, for each misclassification - we would incur a $1,000 loss. A solution to this would be to classify a datapoint as "Unknown" instead of a complete misclassification (these unknowns could then be human-classified which would cost roughly $10 per "Unknown" prediction). Clearly, this is cheaper than the $1,000/misclassification loss.
Any suggestions for would I go about incorporating this "Unknown" class?
I currently have:
svm = LinearSCV()
clf = CalibratedClassifierCV(svm, cv=3)
# fit model
clf.fit(X_train, y_train)
# get probabilities for each decision
decision_probabilities = clf.predict_proba(X_test)
# get the confidence for the highest class:
confidence = [np.amax(x) for x in decision_probabilities]
I was planning to use the predict_proba
method from the CalibratedClassifierCV model, and for any max probabilities that were under a threshold (yet to be determined) I would instead classify that sample as "Unknown" instead of the class that the probability is actually associated with.
The problem is that when I've checked correct predictions, there are confidence values as low as 30%. Similarly, there are incorrect predictions with confidence values as high as 95%. If I were to just create a threshold of say, 50%, my accuracy would go down significantly, I would have quite of bit of "Unknown" classes (loss), and still a bit of misclassifications (even bigger loss).
Is there a way to incorporate another loss function on this back-end classification (predicted class vs 'unknown' class)?
Any help would be greatly appreciated!
Upvotes: 2
Views: 904
Reputation: 601
A few suggestions right off the bat:
Upvotes: 1