Björn
Björn

Reputation: 1832

Running sampling in scikit-learn with imblearn in parallel

I just noticed that the over-/undersampler methods from the imbalanced-learn (imblearn) package now give a future deprecation warning for running in parallel / n_jobs=x argument

FutureWarning: The parameter n_jobs has been deprecated in 0.10 and will be removed in 0.12. You can pass an nearest neighbors estimator where n_jobs is already set instead

So we would instead of passing an int to n_jobs provide an instance of sklearn.neighbors.KNeighborsClassifier with n_jobs instead? Like in this Screenshot? In this example ~ 10% speed-up. Is there anything else to consider here?

enter image description here

MRE Code from the Notebook (screenshotted above)

from imblearn.over_sampling import SMOTE
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV

# Create sample data
# - Settings
N = 500000
n_features = 200

X, y = make_classification(n_samples=N,
                           n_features=n_features,
                           n_clusters_per_class=1,
                           weights=[0.99],
                           flip_y=0,
                           random_state=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

%%timeit -n 5 
sampler = SMOTE()
global X_train, y_train
_, _ = sampler.fit_resample(X_train,y_train)

knn = KNeighborsClassifier(n_jobs=3)

%%timeit -n 5 
global X_train, y_train, knn
sampler = SMOTE(k_neighbors=knn)
_, _ = sampler.fit_resample(X_train,y_train)

Upvotes: 1

Views: 329

Answers (0)

Related Questions