Reputation: 409
In order to tune some machine learning's (or even pipeline's) hyperparameters, sklearn proposes the exhaustive "GridsearchCV" and the randomized "RandomizedSearchCV". The latter samples the provided distributions and test them out, to finally select the best model (and provide the result of each tentative).
But let's say I train 1'000 models using this randomized method. Later, I decide this isn't precise enough, and want to try 1'000 more models. Can I resume the training? Aka, asking to sample more, and try more models without losing current progress. Calling fit()
a second time "restarts" and discards previous hyperparameters combinations.
My situation looks like the following:
pipeline_cv = RandomizedSearchCV(pipeline, distribution, n_iter=1000, n_jobs=-1)
pipeline_cv = pipeline_cv.fit(trainX, trainy)
predictions = pipeline_cv.predict(targetX)
Then, later, I'd decide that 1000 iterations are not enough to cover my distributions' space, so I would do something like
pipeline_cv = pipeline_cv.resume(trainX, trainy, n_iter=1000) # doesn't exist?
And then I'd have a model trained across 2'000 hyperparameters combinations.
Is my goal achievable?
Upvotes: 1
Views: 506
Reputation: 60321
There is a Github issue on that back from Sep 2017, but it is still open:
In practice it is useful to search over some parameter space and then continue the search over some related space. We could provide a
warm_start
parameter to make it easy to accumulate results for further candidates intocv_results_
(without reevaluating parameter combinations that have already tested).
And a similar question in Cross Validated is also effectively unanswered.
So, the answer would seem to be no (plus that the scikit-learn community has not felt the need to include such a possibility).
But let's stop for a moment to think if something like that would be really valuable...
RandomizedSearchCV essentially works by random sampling parameter values from a given distribution; e.g., using the example from the docs:
distributions = dict(C=uniform(loc=0, scale=4),
penalty=['l2', 'l1'])
According to the very basic principles of such random sampling and random number generation (RNG) in general, there is not any guarantee that such a randomly sampled value will not be randomly sampled more than one time, especially if the number of iterations is large. Factor in the fact that RandomizedSearchCV does not do any bookkeeping itself either, hence in principle it can happen that same parameter combinations will be tried more than once in any single run (again, provided that the number of iterations is sufficiently large).
Even in cases of continuous distributions (like the uniform one used above), where the probability of getting exact values already sampled may be very small, there is the routine case of two samples being like 0.678918
and 0.678919
, which, however close, they are still different, and count as different trials.
Given the above, I cannot see how "warm starting" a RandomizedSearchCV will be of any practical use. The real value of RandomizedSearchCV lies at the possibility of sampling a usually large area of parameter values - so large that we consider useful to unleash the power of simple random sampling, which, let me repeat, does not itself "remember" past samples returned, and it may very well return samples that are (exactly or approximately) equal to what has been already returned in the past, thus rendering any "warm start" practically irrelevant.
So effectively, simply running two (or more) RandomizedSearchCV processes sequentially (and storing their results) does the job adequately, provided that we do not use the same random seed for different runs (i.e. what is effectively suggested in the Cross Validated thread mentioned above).
Upvotes: 1