Reputation: 5527
I would like to find the best parameters for a RandomForest classifier (with scikit-learn) in a way that it generalises well to other datasets (which may not be iid). I was thinking doing grid search using the whole training dataset while evaluating the scoring function on other datasets. Is there an easy to do this in python/scikit-learn?
Upvotes: 0
Views: 1761
Reputation: 917
If you can, you may simply merge the two datasets and perform GridSearchCV, this ensures the generalization ability to the other dataset. If you are talking about generalization to future unknown dataset, then this might not work, because there isn't a perfect dataset from which we can train a perfect model.
Upvotes: 1
Reputation: 615
I don't think you can evaluate on a different data set. The whole idea behind GridSearchCV is that it splits your training set into n folds, trains on n-1 of those folds and evaluates on the remaining one, repeating the procedure until every fold has been "the odd one out". This keeps you from having to set apart a specific validation set and you can simply use a training and a testing set.
Upvotes: 2