Output of shape for training after oversampling with imbalanced-learn

Question

I am using imbalanced-learn to oversample my data. I want to know how many entries in each class there are after using the oversampling method. This code works nicely:

import imblearn.over_sampling import SMOTE
from collections import Counter

def oversample(x_values, y_values):
    oversampler = SMOTE(random_state=42, n_jobs=-1)
    x_oversampled, y_oversampled = oversampler.fit_resample(x_values, y_values)
    print("Oversampling training set from {0} to {1} using {2}".format(dict(Counter(y_values)), dict(Counter(y_over_sampled)), oversampling_method))
    return x_oversampled, y_oversampled

But I switched to using a pipeline so I can use GridSearchCV to find the best oversampling method (out of ADASYN, SMOTE and BorderlineSMOTE). Therefore I never actually call fit_resample myself and lose my output using something like this:

from imblearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier

pipe = Pipeline([('scaler', MinMaxScaler()), ('sampler', SMOTE(random_state=42, n_jobs=-1)), ('estimator', RandomForestClassifier())])
pipe.fit(x_values, y_values)

The upsampling works, but I lose my output on how many entries for each class there are in the training set.

Is there a way of getting a similar output than the first example using a pipeline?

Georgios Douzas · Accepted Answer

In theory yes. When an over-sampler is fitted, an attribute sampling_strategy_ is created, containing the number of samples from the minority class(es) to be generated when fit_resample is invoked. You can use it to get a similar output as your example above. Here is a modified example based on your code:

# Imports
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE    
from imblearn.pipeline import Pipeline

# Create toy dataset
X, y = make_classification(weights=[0.20, 0.80], random_state=0)
init_class_distribution = Counter(y)
min_class_label, _ = init_class_distribution.most_common()[-1]
print(f'Initial class distribution: {dict(init_class_distribution)}')

# Create and fit pipeline
pipe = Pipeline([('scaler', MinMaxScaler()), ('sampler', SMOTE(random_state=42, n_jobs=-1)), ('estimator', RandomForestClassifier(random_state=23))])
pipe.fit(X, y)
sampling_strategy = dict(pipe.steps).get('sampler').sampling_strategy_
expected_n_samples = sampling_strategy.get(min_class_label)
print(f'Expected number of generated samples: {expected_n_samples}')

# Fit and resample over-sampler pipeline
 sampler_pipe = Pipeline(pipe.steps[:-1])
X_res, y_res = sampler_pipe.fit_resample(X, y)
actual_class_distribution = Counter(y_res)
print(f'Actual class distribution: {actual_class_distribution}')

Output of shape for training after oversampling with imbalanced-learn

Answers (1)

Related Questions