Xavier Fournat
Xavier Fournat

Reputation: 63

SKLEARN // Combine GridsearchCV with column transform and pipeline

I am struggling with a machine learning project, in which I am trying to combine :

As long as I fill-in the parameters of my different transformers manually in my pipeline, the code is working perfectly. But as soon as I try to pass lists of different values to compare in my gridsearch parameters, I am getting all kind of invalid parameter error messages.

Here is my code :

First I divide my features into numerical and categorical

from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder


numerical_features=make_column_selector(dtype_include=np.number)
cat_features=make_column_selector(dtype_exclude=np.number)

Then I create 2 different preprocessing pipelines for numerical and categorical features:

numerical_pipeline= make_pipeline(KNNImputer())
cat_pipeline=make_pipeline(SimpleImputer(strategy='most_frequent'),OneHotEncoder(handle_unknown='ignore'))

I combined both into another pipeline, set my parameters, and run my GridSearchCV code

model=make_pipeline(preprocessor, LinearRegression() )

params={
    'columntransformer__numerical_pipeline__knnimputer__n_neighbors':[1,2,3,4,5,6,7]
}

grid=GridSearchCV(model, param_grid=params,scoring = 'r2',cv=10)
cv = KFold(n_splits=5)
all_accuracies = cross_val_score(grid, X, y, cv=cv,scoring='r2')

I tried different ways to declare the paramaters, but never found the proper one. I always get an "invalid parameter" error message.

Could you please help me understanding what went wrong?

Really a lot of thanks for your support, and take good care!

Upvotes: 4

Views: 2209

Answers (1)

Venkatachalam
Venkatachalam

Reputation: 16966

I am assuming that you might have defined preprocessor as the following,

preprocessor = Pipeline([('numerical_pipeline',numerical_pipeline),
                        ('cat_pipeline', cat_pipeline)])

then you need to change your param name as following:

pipeline__numerical_pipeline__knnimputer__n_neighbors

but, there are couple of other problems with the code:

  1. you don't have to call cross_val_score after performing GridSearchCV. Output of GridSearchCV itself would have the cross validation result for each combination of hyper parameters.

  2. KNNImputer would not work when you data is having string data. You need to apply cat_pipeline before num_pipeline.

Complete example:

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
import pandas as pd  # doctest: +SKIP
X = pd.DataFrame({'city': ['London', 'London', 'Paris', np.nan],
                  'rating': [5, 3, 4, 5]})  # doctest: +SKIP

y = [1,0,1,1]

from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder


numerical_features=make_column_selector(dtype_include=np.number)
cat_features=make_column_selector(dtype_exclude=np.number)

numerical_pipeline= make_pipeline(KNNImputer())
cat_pipeline=make_pipeline(SimpleImputer(strategy='most_frequent'),
                            OneHotEncoder(handle_unknown='ignore', sparse=False))
preprocessor = Pipeline([('cat_pipeline', cat_pipeline),
                        ('numerical_pipeline',numerical_pipeline)])
model=make_pipeline(preprocessor, LinearRegression() )

params={
    'pipeline__numerical_pipeline__knnimputer__n_neighbors':[1,2]
}


grid=GridSearchCV(model, param_grid=params,scoring = 'r2',cv=2)

grid.fit(X, y)

Upvotes: 2

Related Questions