Karl
Karl

Reputation: 1071

Using multiprocessing -Pool- with -sklearn-, code runs but cores dont show any work

I am trying to do some machine learning on about 31 000 rows and 1000 columns. It takes a long time so I thought I could parallelize the job, so I made it into a funciton and try to use the Tool on my Windows 10 with jupyter notebook. But it just works and when I look at my cores on the Task Manager they are not working. Is there something wrong with the code or is it just not supported?

from sklearn.model_selection import train_test_split
X_dev, X_test, y_dev, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import Imputer
from sklearn.metrics import accuracy_score
from multiprocessing import Pool
from datetime import datetime as dt

def tree_paralel(x):
    tree = DecisionTreeClassifier(criterion="gini", max_depth= x, random_state=1)  
    accuracy_ = []
    for train_idx, val_idx in kfolds.split(X_dev, y_dev):

        X_train, y_train, = X_dev.iloc[train_idx], y_dev.iloc[train_idx]
        X_val, y_val = X_dev.iloc[val_idx], y_dev.iloc[val_idx] 

        X_train = pd.DataFrame(im.fit_transform(X_train),index = X_train.index)
        X_val = pd.DataFrame(im.transform(X_val), index = X_val.index)
        tree.fit(X_train, y_train)
        y_pred = tree.predict(X_val)
        accuracy_.append(accuracy_score(y_val, y_pred))
    print("This was the "+str(x)+" iteration", (dt.now() - start).total_seconds())
    return accuracy_

And then use the multiprocessing tool:

kfolds = KFold(n_splits=10)
accuracy = []
im = Imputer()

p = Pool(5)

input_ = range(1,11)
output_ = []
start = dt.now()
for result in p.imap(tree_paralel, input_):
    output_.append(result)
p.close()
print("Time:", (dt.now() - start).total_seconds())

Upvotes: 3

Views: 1737

Answers (1)

Qusai Alothman
Qusai Alothman

Reputation: 2072

That's a known issue when using interactive python.
Quoting the note from Using a pool of workers section from multiprocessing documentation:

Note: Functionality within this package requires that the __ main__ module be importable by the children. This is covered in Programming guidelines however it is worth pointing out here. This means that some examples, such as the multiprocessing.pool.Pool examples will not work in the interactive interpreter.

See also the multiprocessing Programming Guidelines.

By the way, I didn't get what you need to accomplish with your code. Doesn't using GridSearchCV with n_jobs=5 solve your issue (and greatly simplifies your code)?

Upvotes: 1

Related Questions