Arne
Arne

Reputation: 20137

Python thread locking/class variable initialisation confusion

I have a class which behaves strangely if accessed by multiple threads. The threads are started during sklearn's GridSearch training (with jobs=3), so I don't know exactly how they are called.

My class itself looks roughly like this:

from sklearn.base import BaseEstimator, TransformerMixin
import threading

class FeatureExtractorBase(BaseEstimator, TransformerMixin):
    expensive_dependency = {}
    lock = threading.lock()
    is_loaded = False

    @staticmethod
    def load_dependencies():
        FeatureExtractorBase.lock.acquire()
        if not FeatureExtractorBase.is_loaded:
            print('first request, start loading..')
            # load dependencies, takes a while
            FeatureExtractorBase.is_loaded = True
            print('done')
        else:
            pass
        FeatureExtractorBase.lock.release()

class ActualExtractor(FeatureExtractorBase):
    def transform(self, data):
        FeatureExtractorBase.load_dependencies()
        # generate features from data using dependencies
        return features

This class uses what I hoped would be lazy initialization. Initializing it right away lead to problems, I can't do that any more. And since the class gets reinitialized a couple of times during one program call, initializing the data in the constructor each time would waste time. Now, problem is that this doesn't work the way I want it to, this is the output:

Starting training
Fitting 3 folds for each of 1 candidates, totalling 3 fits
first request, start loading..
first request, start loading..
first request, start loading..
done.
done.
done.
Done   1 jobs       | elapsed:   56.2s
Done   3 out of   3 | elapsed:  1.0min finished

Starting evaluation
first request, start loading..
done.

Not only enter three threads what I thought would be a locked region at the same time, one minute later during tests, the same region gets entered again -- despite that is_loaded should have been set to True at that point.

This is the first time I handle threads in Python, and I am still very clumsy with classes, so I am sure I am doing something wrong here. But I can't see what or where.

Upvotes: 2

Views: 191

Answers (1)

Gecko
Gecko

Reputation: 1408

Threading by itself does not speed up python processes whose bottleneck is CPU and not the IO(read/write) because of the global interpreter lock (GIL). To actually get the speedup sklearn uses multiprocessing for parallelization. This is different from threading in that the objects are copied into a separate process, and thus there are effectively multiple copies of your class. So each time a new process is started, your original, non-initialized class is copied, which then does the loading again.

I think this is the main parallelization module sklearn uses for grid searching. https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/externals/joblib/parallel.py

Upvotes: 1

Related Questions