aggis
aggis

Reputation: 698

Augmenting classification model to prediction "Unknown" instead of a wrong classfication

I am working on a multi-class classification problem, it contains some class imbalance (100 classes, a handful of which only have 1 or 2 samples associated).

I have been able to get a LinearSVC (& CalibratedClassifierCV) model to achieve ~98% accuracy, which is great.

The problem is that for all of the misclassified predictions - the business will incur a monetary loss. That is, for each misclassification - we would incur a $1,000 loss. A solution to this would be to classify a datapoint as "Unknown" instead of a complete misclassification (these unknowns could then be human-classified which would cost roughly $10 per "Unknown" prediction). Clearly, this is cheaper than the $1,000/misclassification loss.

Any suggestions for would I go about incorporating this "Unknown" class?

I currently have:

svm = LinearSCV()
clf = CalibratedClassifierCV(svm, cv=3)


# fit model
clf.fit(X_train, y_train)

# get probabilities for each decision

decision_probabilities = clf.predict_proba(X_test)

# get the confidence for the highest class:
confidence = [np.amax(x) for x in decision_probabilities]

I was planning to use the predict_proba method from the CalibratedClassifierCV model, and for any max probabilities that were under a threshold (yet to be determined) I would instead classify that sample as "Unknown" instead of the class that the probability is actually associated with.

The problem is that when I've checked correct predictions, there are confidence values as low as 30%. Similarly, there are incorrect predictions with confidence values as high as 95%. If I were to just create a threshold of say, 50%, my accuracy would go down significantly, I would have quite of bit of "Unknown" classes (loss), and still a bit of misclassifications (even bigger loss).

Is there a way to incorporate another loss function on this back-end classification (predicted class vs 'unknown' class)?

Any help would be greatly appreciated!

Upvotes: 2

Views: 904

Answers (1)

artemis
artemis

Reputation: 601

A few suggestions right off the bat:

  1. Accuracy is not the correct metric to evaluate imbalanced datasets. For example, if 90% of samples belong to 1 class 90% accuracy is achieved by a dumb model which always predicts the dumb class. Precision and recall are generally better metrics for such cases. Opting between the two is generally a business decision.
  2. Given the input signals, it may be difficult to better than 98%, especially for some classes you will have two few samples. What you can do is group minority classes together and give them a single label e.g 'other'. In this way, the model will hopefully have enough samples to learn that these samples are different from all other classes and will classify them as 'other'
  3. Often when you try to replace a manual business process by ML, you generally do not completely remove human intervention. The goal is to use the model on cases/classes/input space where your model does well and use the manual process for the rest. One way to do it is by using the 'other' label. Once your model has predicted 'other', a human may manually classify these samples. Another method is to find a threshold on predicted probability above which the model has a high accuracy and sufficient population coverage. For example, let say you have 100% (typically 90-100%) accuracy whenever the output prbability is above 0.70. If this covers enough of the input population, you only use the ML model on such cases. For everything else, the manual process is followed.

Upvotes: 1

Related Questions