Reputation: 31
I'm trying to solve a binary classification problem where 80% of the data belongs to class x and 20% of the data belongs to class y. All my models (AdaBoost, Neural Networks and SVC) just predict all data to be part of class x as this is the highest accuracy they can achieve.
My goal is to achieve a higher precision for all entries of class x and I don't care how many entries are falsely classified to be part of class y.
My idea would be to just put entries in class x when the model is super sure about them and put them in class y otherwise.
How would I achieve this? Is there a way to move the treshold so that only very obvious entries are classified as class x?
I'm using python and sklearn
Sample Code:
adaboost = AdaBoostClassifier(random_state=1)
adaboost.fit(X_train, y_train)
adaboost_prediction = adaboost.predict(X_test)
confusion_matrix(adaboost_prediction,y_test) outputs:
array([[ 0, 0],
[10845, 51591]])
Upvotes: 3
Views: 98
Reputation: 27197
Using AdaBoostClassifier
you can output class probabilities and then threshold them by using predict_proba
instead of predict
:
adaboost = AdaBoostClassifier(random_state=1)
adaboost.fit(X_train, y_train)
adaboost_probs = adaboost.predict_proba(X_test)
threshold = 0.8 # for example
thresholded_adaboost_prediction = adaboost_probs > threshold
Using this approach you could also inspect (just debug print, or maybe sort and plot on a graph) how the confidence levels vary in your final model on the test data to help decide whether it is worth taking further.
There is more than one way to approach your problem though. For example see Miriam Farber's answer which looks at re-weighting the classifier to adjust for your 80/20 class imbalance during training. You might find you have other problems, including perhaps the classifiers you are using cannot realistically separate x and y classes given your current data. Going through all possibilities of a data problem like this might take a few different approaches.
If you have more questions about issues with your data problem as opposed to the code, there are Stack Exchange sites that could help you as well as Stack Overflow (do read the site guidelines before posting): Data Science and Cross Validated.
Upvotes: 4
Reputation: 19634
In SVM, one way to move the threshold is to choose class_weight
in such a way that you put much more weight on data points from class y
. Consider the below example, taken from SVM: Separating hyperplane for unbalanced classes:
The straight line is the decision boundary that you get when you use SVC
with default class weights (same weight for every class). The dashed line is the decision boundary that you get when you use class_weight={1: 10}
(that is, put much more weight on class 1, relatively to class 0).
Class weights besically adjust the penalty parameter in SVM:
class_weight : {dict, ‘balanced’}, optional
Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
Upvotes: 2