Reputation: 723
I am working on binary classification problem on a dataset with extreme class imbalance. To help the model learn the signals of the minority class, I downsampled the majority class such that the training set has 20% of minority class and 80% majority class.
Now there is one other parameter "scale_pos_weight" . I am not sure how to set this parameter after downsampling.
Should i set this based on the actual class ratios or should i use the class ratios after downsampling?
Upvotes: 3
Views: 4569
Reputation: 7281
Good question. XGBoost
has been known to do well for imbalanced datasets, and includes a number of hyperparameters to help us get there.
For the scale_pos_weight
feature, XGBoost documentation suggests:
sum(negative instances) / sum(positive instances)
For extremely unbalanced datasets, some have suggested using the sqrt
of that formula above.
For weights, typically via the sample_weight
parameter in XGBoost, you can learn class_weights
via a sklearn utility, as described here.
The difference between the two is explored here, but in summary:
The sample_weight parameter allows you to specify a different weight for each training example. The scale_pos_weight parameter lets you provide a weight for an entire class of examples ("positive" class).
In code, you can see these implementations below, including the square root. Please note, I had to use synthetic data since none was provided in the question.
# General imports
import pandas as pd
from sklearn import datasets
from collections import Counter
# Generate datasets
from sklearn.datasets import make_classification
from imblearn.datasets import make_imbalance
# Train, test, splits and gridsearch optimization
from sklearn.model_selection import train_test_split, GridSearchCV
# Class weights
from sklearn.utils import class_weight
# Performance
from sklearn.metrics import classification_report
# Modeling
import xgboost
import warnings
warnings.filterwarnings('ignore')
# Generate synthetic data
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, class_sep=2.0, n_classes=2, n_clusters_per_class=5, hypercube=True, random_state=30)
scaled_X, scaled_y = make_imbalance(X, y, sampling_strategy={0:200}, random_state=8)
data = pd.DataFrame(data=scaled_X, columns=['feature_{}'.format(i) for i in range(X.shape[1])])
X_train, X_test, y_train, y_test = train_test_split(data, scaled_y, random_state=8, stratify=scaled_y)
# Compare 3 XGBoost models: no changes to weights, using sample weights, and using weight_scale
# Build a model without using the scale_pos_weight parameter, fit it, and get a set of its performance measures.
model_no_scale = xgboost.XGBClassifier(random_state=30)
model_no_scale.fit(X_train, y_train)
# Print performance
print("Off the Shelf XGBoost")
print(classification_report(y_test, model_no_scale.predict(X_test)))
# Get class_weights
# https://datascience.stackexchange.com/questions/16342/unbalanced-multiclass-data-with-xgboost
model_weights = xgboost.XGBClassifier(sample_weight=class_weight.compute_sample_weight(class_weight='balanced', y=scaled_y), random_state=30)
model_weights.fit(X_train, y_train)
# Print performance
print("Weights XGBoost")
print(classification_report(y_test, model_weights.predict(X_test)))
# Get the counts of the training data per XGBoost documentation
counts = Counter(y_train)
model_scale = xgboost.XGBClassifier(scale_pos_weight=counts[0] / counts[1], random_state=30)
model_scale.fit(X_train, y_train)
# Print performance
print("Scale XGBoost")
print(classification_report(y_test, model_scale.predict(X_test)))
# Get the counts of the training data per XGBoost documentation
from math import sqrt
model_sqrt = xgboost.XGBClassifier(scale_pos_weight=sqrt(counts[0] / counts[1]), random_state=30)
model_sqrt.fit(X_train, y_train)
# Print performance
print("SQRT XGBoost")
print(classification_report(y_test, model_sqrt.predict(X_test)))
Results in:
Off the Shelf XGBoost
precision recall f1-score support
0 0.95 0.38 0.54 50
1 0.98 1.00 0.99 1253
accuracy 0.98 1303
macro avg 0.96 0.69 0.77 1303
weighted avg 0.97 0.98 0.97 1303
Weights XGBoost
precision recall f1-score support
0 0.95 0.38 0.54 50
1 0.98 1.00 0.99 1253
accuracy 0.98 1303
macro avg 0.96 0.69 0.77 1303
weighted avg 0.97 0.98 0.97 1303
Scale XGBoost
precision recall f1-score support
0 0.73 0.64 0.68 50
1 0.99 0.99 0.99 1253
accuracy 0.98 1303
macro avg 0.86 0.82 0.83 1303
weighted avg 0.98 0.98 0.98 1303
SQRT XGBoost
precision recall f1-score support
0 0.96 0.46 0.62 50
1 0.98 1.00 0.99 1253
accuracy 0.98 1303
macro avg 0.97 0.73 0.81 1303
weighted avg 0.98 0.98 0.97 1303
Upvotes: 1
Reputation: 2348
The class weights are used when computing the loss function to prevent the model from giving importance to the major class. If one class dominates the dataset, then the model will be biased to learn that class better because the loss is mainly determined by the model's performance on that dominant class.
Let's consider an extreme case where the dataset contains 99 percent positive samples. If a model just predicts 1 for every sample, it will have 99 percent accuracy. The idea behind class weights is that you want every sample to contribute to the loss equally. Therefore, you should compute this ratio based on your training set because the loss is computed on your training set. Your model has no idea about the samples that you dropped.
If you make a correct prediction, the loss is 0, and otherwise not. Coming to your case, to make sure that every sample contributes to the loss equally, a false prediction for the minority class should be penalized 4 times more than a false prediction for the majority class. So that, the model can not ignore a certain class or have a bias towards the majority class.
It is generally a good idea to set the class weight anti-proportional to the number of samples you have for that particular class. So, in your case, that would be 4. However, in practice, you should probably try out few different values to find the best weights.
Another important aspect is the ratio of these samples in the wild. You said that you made down-sampling, if the ratio of classes differs in the wild compared to your training dataset, then you might observe worse scores when you deploy your model or when you are testing it on unseen samples. That is why you should ideally also split your validation and test sets with realistic ratios using your domain knowledge
Upvotes: 1
Reputation: 116
Since you've already down-sampled the data, the scale_pos_weight
parameter should be set according to your down-sampled data. Calculate the value using :
scale_pos_weight = count(negative examples)/count(Positive examples)
In you case,
scale_pos_weight = 80/20 = 4
You can also use hyperparameter optimization to find the best set of parameters automatically.
Upvotes: 0