Reputation: 20137
I have a class which behaves strangely if accessed by multiple threads. The threads are started during sklearn's GridSearch training (with jobs=3), so I don't know exactly how they are called.
My class itself looks roughly like this:
from sklearn.base import BaseEstimator, TransformerMixin
import threading
class FeatureExtractorBase(BaseEstimator, TransformerMixin):
expensive_dependency = {}
lock = threading.lock()
is_loaded = False
@staticmethod
def load_dependencies():
FeatureExtractorBase.lock.acquire()
if not FeatureExtractorBase.is_loaded:
print('first request, start loading..')
# load dependencies, takes a while
FeatureExtractorBase.is_loaded = True
print('done')
else:
pass
FeatureExtractorBase.lock.release()
class ActualExtractor(FeatureExtractorBase):
def transform(self, data):
FeatureExtractorBase.load_dependencies()
# generate features from data using dependencies
return features
This class uses what I hoped would be lazy initialization. Initializing it right away lead to problems, I can't do that any more. And since the class gets reinitialized a couple of times during one program call, initializing the data in the constructor each time would waste time. Now, problem is that this doesn't work the way I want it to, this is the output:
Starting training
Fitting 3 folds for each of 1 candidates, totalling 3 fits
first request, start loading..
first request, start loading..
first request, start loading..
done.
done.
done.
Done 1 jobs | elapsed: 56.2s
Done 3 out of 3 | elapsed: 1.0min finished
Starting evaluation
first request, start loading..
done.
Not only enter three threads what I thought would be a locked region at the same time, one minute later during tests, the same region gets entered again -- despite that is_loaded
should have been set to True
at that point.
This is the first time I handle threads in Python, and I am still very clumsy with classes, so I am sure I am doing something wrong here. But I can't see what or where.
Upvotes: 2
Views: 191
Reputation: 1408
Threading by itself does not speed up python processes whose bottleneck is CPU and not the IO(read/write) because of the global interpreter lock (GIL). To actually get the speedup sklearn uses multiprocessing for parallelization. This is different from threading in that the objects are copied into a separate process, and thus there are effectively multiple copies of your class. So each time a new process is started, your original, non-initialized class is copied, which then does the loading again.
I think this is the main parallelization module sklearn uses for grid searching. https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/externals/joblib/parallel.py
Upvotes: 1