Reputation: 153
I know that you can set scale_pos_weight for an imbalanced dataset. However, How to deal with the multi-classification problem in the imbalanced dataset. I have gone through https://datascience.stackexchange.com/questions/16342/unbalanced-multiclass-data-with-xgboost/18823 but don't quite understand how to set weight parameter in Dmatrix.
Can anyone please explain in detail?
Upvotes: 5
Views: 12697
Reputation: 153
XGBClassifier
request don't support weights
parameter
Please, use sample_weight
in fit()
request: fit(X, y, sample_weight)
For detail information please xgboost python api
For use weight
in direct request use another algo: random forest or light gbm
Upvotes: 1
Reputation: 361
For imbalanced dataset, I used the "weights" parameter in Xgboost where weights is an array of weight assigned according to the class the data belongs to.
def CreateBalancedSampleWeights(y_train, largest_class_weight_coef):
classes = np.unique(y_train, axis = 0)
classes.sort()
class_samples = np.bincount(y_train)
total_samples = class_samples.sum()
n_classes = len(class_samples)
weights = total_samples / (n_classes * class_samples * 1.0)
class_weight_dict = {key : value for (key, value) in zip(classes, weights)}
class_weight_dict[classes[1]] = class_weight_dict[classes[1]] *
largest_class_weight_coef
sample_weights = [class_weight_dict[y] for y in y_train]
return sample_weights
Just pass the target column and the occurance rate of most frequent class (if most frequent class has 75 out of 100 samples, then its 0.75)
largest_class_weight_coef =
max(df_copy['Category'].value_counts().values)/df.shape[0]
#pass y_train as numpy array
weight = CreateBalancedSampleWeights(y_train, largest_class_weight_coef)
#And then use it like this
xg = XGBClassifier(n_estimators=1000, weights = weight, max_depth=20)
Thats it :)
Upvotes: 5