use gridsearchCV to tune hyperparameters that change a pandas df

Question

I would like to use use gridsearchCV to tune hyperparameters from user defined estimators that perform on pandas dataframes. For example, impute the median, choose to include pass a column or not to the estimators and so on. Below, I exemplify just with a column selector, but the idea is to be able to tune the parameters in a more complex way. I keep getting some cryptic messages that I cannot yet decipher. For example, 'list' object has no attribute 'flags':

from sklearn.datasets import california_housing
from sklearn.linear_model import Ridge
from sklearn.base import BaseEstimator
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np

cal_house = california_housing.fetch_california_housing()
data      = cal_house['data']
names     = cal_house['feature_names']

df = pd.DataFrame(data, columns=names)
df['houseval'] = cal_house['target']

class ColumnSelector(BaseEstimator):
    def __init__(self, columns_for_x = ['MedInc','HouseAge']):

        self.columns     = columns_for_x
        #self.lags        = lags
        #self.grouper_col = grouper_col

    def fit(self, X, y):
        return self


    def transform(self, X, y):

        X = X.loc[:,self.columns].values

        return X, y


pipe       = Pipeline([('colselect', ColumnSelector()),
                        ('Ridge', Ridge())])

gridsearch  = GridSearchCV(cv=5, scoring='r2',
                          param_grid= {'colselect__columns_for_x':[['MedInc','HouseAge'],
                                                                   ['MedInc','Population','Latitude'],
                                                                   ['MedInc','AveRooms','AveOccup']],
                                       'Ridge__alpha':[0.001,0.01,0.1,1,10]}, estimator=pipe)

X = df.drop('houseval', axis = 1).values
y = df.loc[:,'houseval'].values
# gridsearch.fit(X=X,y=y)

Nablezen · Accepted Answer

I deviated somewhat from your original code -- mostly because the intricacies of using a custom estimator to achieve your desired column select transformation proved to be too much overhead for me.

This is my solution:

from sklearn.datasets import california_housing
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
import pandas as pd
from sklearn.preprocessing import FunctionTransformer

cal_house = california_housing.fetch_california_housing()
data = cal_house['data']
names = cal_house['feature_names']

df = pd.DataFrame(data, columns=names)
df['houseval'] = cal_house['target']


def keep_columns(X, columns=("MedInc", "HouseAge")):
    column_indices = [
        names.index(name) for name in columns
    ]

    return X[:, column_indices]

pipe = Pipeline([
    ("colselect", FunctionTransformer(keep_columns)),
    ("Ridge", Ridge()),
])

gridsearch = GridSearchCV(
    cv=5, scoring='r2',
    param_grid={
        'colselect__inv_kw_args': [
            {"columns": columns}
            for columns in [
                ['MedInc', 'HouseAge'],
                ['MedInc', 'Population', 'Latitude'],
                ['MedInc', 'AveRooms', 'AveOccup']
            ]
        ],
        'Ridge__alpha': [0.001, 0.01, 0.1, 1, 10]
    },
    estimator=pipe
)

X = df.drop('houseval', axis=1).values
y = df.loc[:, 'houseval'].values
gridsearch.fit(X=X, y=y)

The main issues with your code:

Using a custom class in the pipeline to achieve your transformation is excessive -- it is basically just an array access. Your transformation code in the class assumed a DataFrame, but gridsearch passes numpy.array objects instead. In this numpy array, accesses need to be done by index, so my code calculates these indices from your feature names array names.
A custom estimator requires you to provide a method to copy estimators that must ensure proper copying of all parameters, otherwise it appears as though they become None, since sklearn.model_selection.GridSearchCV attempts to copy them without deep copy.

use gridsearchCV to tune hyperparameters that change a pandas df

Answers (1)

Related Questions