Spedo
Spedo

Reputation: 365

Handling unbalanced data in GradientBoostingClassifier using weighted class?

I have a very unbalanced dataset that I need to build a model on top of that for a classification problem. The dataset has around 30000 samples which around 1000 samples are labelled as—1—, and the rest are 0. I build the model by the following lines:

X_train=training_set
y_train=target_value
my_classifier=GradientBoostingClassifier(loss='deviance',learning_rate=0.005)
my_model = my_classifier.fit(X_train, y_train)

Since, this is an unbalanced data, it is not correct to build the model simply like the above code, so I have tried to use class weights as follows:

class_weights = compute_class_weight('balanced',np.unique(y_train), y_train)

Now, I have no idea how I can use class_weights (which basically includes 0.5 and 9.10 values) to train and build the model using GradientBoostingClassifier.

Any idea? How I can handle this unbalanced data with weighted class or other techniques?

Upvotes: 5

Views: 6010

Answers (1)

MaximeKan
MaximeKan

Reputation: 4211

You should be using sample weights instead of class weights. In other words, GradientBoostingClassifier lets you assign weights to each observation and not to classes. This is how you can do it, supposing y = 0 corresponds to the weight 0.5 and y = 1 to the weight 9.1:

import numpy as np
sample_weights = np.zeros(len(y_train))
sample_weights[y_train == 0] = 0.5
sample_weights[y_train == 1] = 9.1

Then pass these weights to the fit methodology:

my_model = my_classifier.fit(X_train, y_train, sample_weight = weights)

Upvotes: 5

Related Questions