Reputation: 6543
What is the best way to deal with an imbalanced test set in scikit-learn?
My training data is split 70/30 between two classes, where as the out-of-sample data is likely to be more like 90/10. I'm using random forest, logistic regression, and gradient boosting for classification and care about the probability output.
Upvotes: 1
Views: 3924
Reputation: 363547
If you use logistic regression, you can try the following:
class_weight="auto"
to the LogisticRegression
constructor. You may also want to set intercept_scaling=1e3
(or some other large value). See the docstring for details.Edit: As per sklearn version 0.17 'class_weight="balanced"'.
Change the intercept of the model. The class_weight
should have made sure that you got the intercept (log-odds prior) for a 50/50 split, which can be turned into one for a 90/10 split by
prior = .9
lr.intercept_ += np.log(prior / (1 - prior)) - np.log((1 - prior) / prior)
This mathematical trick is common in epidemiology (or so I've been told), where often you have a set of n_positive
cases and a very small prior probability of disease, but getting a control group of actual size prior / (1 - prior) * n_positive
is prohibitively expensive.
Similar tricks can be played with other probability models by multiplying the prior into their outputs, rather than folding it into the model directly. Naive Bayes (not a good probability model, but I'll mention it anyway) actually takes an optional class_prior
argument.
Upvotes: 4
Reputation: 763
scikit-learn package have some buit in arsenal to deal with class imbalance. For example, sklearn.model_selection.GridSearchCV by default have this split mechanism: "For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used". "The folds are made by preserving the percentage of samples for each class." So, when you cross-validate by GridSearchCV, you always have same proportions of classes in each fold as in all data. May be this helps you somehow.
Upvotes: 0
Reputation: 61
Trevor Hastie's book, The Elements of Statistical Learning (free PDF!), that describes gradient boosting is a good reference on that work if that's your method of getting a probabilistic output. As with pretty much any ML method you should want look at appropriate regularization and shrinkage to correct overfitting and bias.
Logistic regression as mentioned here provides some techniques to correct sample class sizes. A nice thing about LR is that it is relatively well behaved with imbalanced class sizes. If you are working on huge amounts of data then a log-linear stochastic gradient decent works pretty well. A rule-of-thumb of mine is that when possible I like to take an idea of mine and check it against an old-fashioned LR or Naive Bayes -- LR is about the simplest markov model you can have and NB is about the simples BayesNet you could have. Oftentimes a correctly tuned LR model scales well and can give you what you really want.
As for metrics an ROC curve gives you ranking ability which doesn't tell you how well your probabilities are calibrated. There's an ICML paper called Bier Curves that can give you the information about the ROC curve as well as meaningful data as to how your probabilities are calibrated. Or if you want to keep it simple chart something like balanced accuracy against your predictions scores to see how things are mapping, alongside an ROC chart and you probably have a good idea of how your data metrics work out.
Of course the key issue with all of this is data keep your validation set and modeling sets separate, etc. Good data hygiene is really central and I think more than anything that is what the core of your question is at. 70/30 vs 90/10. I run into a similar problem where our internal corpora are highly biased. And really that goes back to a bit of you using expert opinion and studying if the system is overfitting when placed with real data or if you need to fix the data a bit to be more realistic. Are you more concerned with FPs or coverage? Really answering your first question comes down to the business context of what you are trying to do: prediction, classification, make money, do homework.
You may want to recalibrate your probabilities -- if you are using prob. output to feed to another ML system I wouldn't necessary worry so much about recalibration but if its used somewhere where you really expect a prob. output look at maybe a Beta curve correction of some sort or something like isotonic regression.
I've written a lot but answered little. My trite answer would be to work from some of the excellent examples and bake off your solution against a gradient-descent (log-linear) example or LogisticRegression class. For your validation you want a metric that includes both probability calibration and ranking ... I would say generate both AUC and something like a deviances against your sample probabilities. That's a start at least. Study your data and see if at the end you are satisfied you are going int the right direction.
I hope that's helpful.
Upvotes: 2
Reputation: 17015
For imbalanced dataset, model evaluation should be done using area under the ROC curve. The AUC score in sklearn can be found using metrics.roc_auc_score()
. AUC can sometimes not give a proper evaluation of the model and thus you should also consider a calibration curve along with the auc score (if required).
Upvotes: 0