Reputation: 1071
I am trying to do some machine learning on about 31 000 rows and 1000 columns. It takes a long time so I thought I could parallelize the job, so I made it into a funciton and try to use the Tool on my Windows 10 with jupyter notebook. But it just works and when I look at my cores on the Task Manager they are not working. Is there something wrong with the code or is it just not supported?
from sklearn.model_selection import train_test_split
X_dev, X_test, y_dev, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import Imputer
from sklearn.metrics import accuracy_score
from multiprocessing import Pool
from datetime import datetime as dt
def tree_paralel(x):
tree = DecisionTreeClassifier(criterion="gini", max_depth= x, random_state=1)
accuracy_ = []
for train_idx, val_idx in kfolds.split(X_dev, y_dev):
X_train, y_train, = X_dev.iloc[train_idx], y_dev.iloc[train_idx]
X_val, y_val = X_dev.iloc[val_idx], y_dev.iloc[val_idx]
X_train = pd.DataFrame(im.fit_transform(X_train),index = X_train.index)
X_val = pd.DataFrame(im.transform(X_val), index = X_val.index)
tree.fit(X_train, y_train)
y_pred = tree.predict(X_val)
accuracy_.append(accuracy_score(y_val, y_pred))
print("This was the "+str(x)+" iteration", (dt.now() - start).total_seconds())
return accuracy_
And then use the multiprocessing tool:
kfolds = KFold(n_splits=10)
accuracy = []
im = Imputer()
p = Pool(5)
input_ = range(1,11)
output_ = []
start = dt.now()
for result in p.imap(tree_paralel, input_):
output_.append(result)
p.close()
print("Time:", (dt.now() - start).total_seconds())
Upvotes: 3
Views: 1737
Reputation: 2072
That's a known issue when using interactive python.
Quoting the note from Using a pool of workers section from multiprocessing
documentation:
Note: Functionality within this package requires that the __ main__ module be importable by the children. This is covered in Programming guidelines however it is worth pointing out here. This means that some examples, such as the multiprocessing.pool.Pool examples will not work in the interactive interpreter.
See also the multiprocessing Programming Guidelines.
By the way, I didn't get what you need to accomplish with your code. Doesn't using GridSearchCV
with n_jobs=5
solve your issue (and greatly simplifies your code)?
Upvotes: 1