vojta
vojta

Reputation: 122

Can I make decision tree more sensitive to false negative?

I have a following question. I am predicting data using Decision tree classifier from sklearn package. My confusion matrix looks like this:

enter image description here

I don`t care so much about the total accuracy, but I need to predict when dependent variable is 0. In other words, I want to reduce false negative rate and I am OK with the fact that false positive rate will also increase. Is there a way, how to do this in python?

My code has this structure:

from sklearn import tree

clf = tree.DecisionTreeClassifier(max_depth=30)
clf.fit(X_train, y_train)

plot_confusion_matrix(clf, X_test, y_test)

Thanks for any hint.

Upvotes: 1

Views: 1717

Answers (2)

Elj
Elj

Reputation: 617

Your classes are imbalanced, which means that one of the classes have far more samples than the other. For your use case (where not all mistakes are equal, and some are more serious than others) you can take a look at Cost-Sensitive Learning for Imbalanced Classification.

Cost-sensitive learning is a subfield of machine learning that takes the costs of prediction errors (and potentially other costs) into account when training a machine learning model. It is a field of study that is closely related to the field of imbalanced learning that is concerned with classification on datasets with a skewed class distribution. As such, many conceptualizations and techniques developed and used for cost-sensitive learning can be adopted for imbalanced classification problems.

The scikit-learn Python machine learning library provides examples of these cost-sensitive extensions via the class_weight argument on the following classifiers: SVC, DecisionTreeClassifier

Hence, you can try looking into that parameter for a start.

Upvotes: 0

DerekG
DerekG

Reputation: 3958

I'll suggest two possible solutions to this problem.

  1. Without refitting the decision tree, you can look at the predicted probabilities for each class. These probabilities sum to 1, and by default the highest probability is selected as the relevant class. You can introduce a bias term such that the positive label is predicted if it has a probability of 0.5 - bias or higher, with a larger bias term meaning this class is predicted more frequently. Note that this will likely both increase false positives, and decrease total aggregate accuracy. You can access the predicted probabilities with clf.predict_proba()

  2. Refit the decision tree iteratively - select the examples that are classified incorrectly as false negatives, and retrain a new decision tree on the original data, but with these false negative examples given extra weighting (one trivial way to do this within the sklearn framework is simply to repeat these examples multiple times). You can repeat this training and data weighting for several iterations, though eventually it may result is some strange overfitting. Note that this is roughly the strategy used by the AdaBoost algorithm, with the exception that a new decision tree is trained at each step and ALL trained decision trees are aggregated at the end to make class predictions. You could also simply use the class-based weighting, though this would be less sensitive to which examples are hard-to-classify.

As a final aside, I'd say to be careful in this case. You want to reduce the number of FN, increasing the number of FP if necessary. Your data is already imbalanced with a disproportionately high number of positive examples (3384 + 455) versus negative examples (102 + 426). Consider, for instance, that you could train a naive decision tree that always predicts class 1. In this case your overall accuracy would be 0.879, versus 0.798 in your current case.

Upvotes: 1

Related Questions