Reputation: 37
I had a general doubt for Cross Validation.
In the notebook for module 2 it is mentioned that one should use pipelines for Cross Validation in order to prevent data leakage. I understand why , however had a doubt regarding the pipeline function:
If I want to use three functions in a pipeline : MinMaxScaler()
, PolynomialFeatures
(for multiple degrees) and A Ridge
in the end(for multiple alpha values). Since I want to find the best model after using multiple param values , I will use the GridSearchCV()
function which does cross validation and gives the best model score.
However after I intialise a pipeline object with the three functions and insert it in the GridSearchCV()
function , how do I insert the multiple degrees and aplha values in the params
parameter of the GridSearchCV()
function . Do I insert the params as a list of two lists in the order of which the functions have been defined in the pipeline object or do I send a dictionary of two lists, where the keys are the object names of the functions in the pipeline ?????
Upvotes: 2
Views: 1254
Reputation: 16966
You just have to feed it as a dictionary.
Try this example:
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_regression
from sklearn.model_selection import GridSearchCV
X, y = make_regression(random_state=42)
pipe = make_pipeline(MinMaxScaler(), PolynomialFeatures(), Ridge())
pipe
# Pipeline(steps=[('minmaxscaler', MinMaxScaler()),
# ('polynomialfeatures', PolynomialFeatures()),
# ('ridge', Ridge())])
gs = GridSearchCV(pipe, param_grid={'polynomialfeatures__degree': [2,4],
'ridge__alpha': [1,10]}).fit(X, y)
# gs.best_params_
# {'polynomialfeatures__degree': 2, 'ridge__alpha': 1}
Upvotes: 1