Reputation: 2132
I want to perform Machine Learning
algorithms from Sklearn
library on all my cores using Dask
and joblib
libraries.
My code for the joblib.parallel_backend with Dask:
#Fire up the Joblib backend with Dask:
with joblib.parallel_backend('dask'):
model_RFE = RFE(estimator = DecisionTreeClassifier(), n_features_to_select = 5)
fit_RFE = model_RFE.fit(X_values,Y_values)
Unfortunetly when I look at my task manager I can see all my workers chillin and doing nothing, and only 1 new Python task is doing all the job:
Even in my Dask visualization on Client I see the workers doing nothing:
joblib
I would welcome any other ideas.My whole code attempt following this tutorial from docs:
import pandas as pd
import dask.dataframe as df
from dask.distributed import Client
import sklearn
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
import joblib
#Create cluset on local PC
client = Client(n_workers = 4, threads_per_worker = 1, memory_limit = '4GB')
client
#Read data from .csv
dataframe_lazy = df.read_csv(path, engine = 'c', low_memory = False)
dataframe = dataframe_lazy.compute()
#Get my X and Y values and realse the original DF from memory
X_values = dataframe.drop(columns = ['Id', 'Target'])
Y_values = dataframe['Target']
del dataframe
#Prepare data
X_values.fillna(0, inplace = True)
#Fire up the Joblib backend with Dask:
with joblib.parallel_backend('dask'):
model_RFE = RFE(estimator = DecisionTreeClassifier(), n_features_to_select = 5)
fit_RFE = model_RFE.fit(X_values,Y_values)
Upvotes: 5
Views: 3394
Reputation: 409
The Dask joblib backend will not be able to parallelize all scikit-learn models, only some of them as indicated in the Parallelism docs. This is because many scikit-learn models only support sequential training either due to the algorithm implementations or because parallel support has not been added.
Dask will only be able to parallelize models that have an n_jobs
paramemeter, which indicates that the scikit-learn model is written in a way to support parallel training. RFE
and DecisionTreeClassifier
do not have an n_jobs
paramemter. I wrote this gist that you can run to get a full list of the models that support parallel training
Upvotes: 11