Pavel Komarov
Pavel Komarov

Reputation: 1246

How can I use sklearn's GridSearchCV with data that doesn't fit in memory?

I have a dataset that is much too large to fit in memory, so I must train models in batches. I have wrapped my model in a GridSearchCV, a RandomizedSearchCV, or a BayesSearchCV (from scikit-optimize) and see I can not train multiple instances of these on different parts of my enormous dataset and expect the best hyperparameters found by each will agree.

I have considered wrapping my estimators in a BatchVoter (of my own design) that manages reading from the database in batches and keeps a list of models. Passing this to the XSeachCV and updating the parameter space dictionary so all keys lead with 'estimator__' might direct the the search to set the parameters of the sub-object, but there is still a problem: A search is begun with a call to the .fit() method, which must take data.

Is there a clever way to use the native GridSearchCV with data that is too big to pass to the .fit() method?

Upvotes: 2

Views: 1323

Answers (1)

wl2776
wl2776

Reputation: 4327

Try dask. It supports Data Frames, arrays and collections. It consists of a scheduler and workers. It also has a distributed scheduler, allowing to process data frames on several PCs.

Here is the description on how to parallelize models.

Here is the link to a complete module, that could be a drop-in replacement of GridSearchCV

Upvotes: 1

Related Questions