Reputation: 1246
I have a dataset that is much too large to fit in memory, so I must train models in batches. I have wrapped my model in a GridSearchCV, a RandomizedSearchCV, or a BayesSearchCV (from scikit-optimize) and see I can not train multiple instances of these on different parts of my enormous dataset and expect the best hyperparameters found by each will agree.
I have considered wrapping my estimators in a BatchVoter (of my own design) that manages reading from the database in batches and keeps a list of models. Passing this to the XSeachCV and updating the parameter space dictionary so all keys lead with 'estimator__' might direct the the search to set the parameters of the sub-object, but there is still a problem: A search is begun with a call to the .fit() method, which must take data.
Is there a clever way to use the native GridSearchCV with data that is too big to pass to the .fit() method?
Upvotes: 2
Views: 1323
Reputation: 4327
Try dask. It supports Data Frames, arrays and collections. It consists of a scheduler and workers. It also has a distributed scheduler, allowing to process data frames on several PCs.
Here is the description on how to parallelize models.
Here is the link to a complete module, that could be a drop-in replacement of GridSearchCV
Upvotes: 1