Reputation: 2314
I would like to use use gridsearchCV to tune hyperparameters from user defined estimators that perform on pandas dataframes. For example, impute the median, choose to include pass a column or not to the estimators and so on. Below, I exemplify just with a column selector, but the idea is to be able to tune the parameters in a more complex way. I keep getting some cryptic messages that I cannot yet decipher. For example, 'list' object has no attribute 'flags'
:
from sklearn.datasets import california_housing
from sklearn.linear_model import Ridge
from sklearn.base import BaseEstimator
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np
cal_house = california_housing.fetch_california_housing()
data = cal_house['data']
names = cal_house['feature_names']
df = pd.DataFrame(data, columns=names)
df['houseval'] = cal_house['target']
class ColumnSelector(BaseEstimator):
def __init__(self, columns_for_x = ['MedInc','HouseAge']):
self.columns = columns_for_x
#self.lags = lags
#self.grouper_col = grouper_col
def fit(self, X, y):
return self
def transform(self, X, y):
X = X.loc[:,self.columns].values
return X, y
pipe = Pipeline([('colselect', ColumnSelector()),
('Ridge', Ridge())])
gridsearch = GridSearchCV(cv=5, scoring='r2',
param_grid= {'colselect__columns_for_x':[['MedInc','HouseAge'],
['MedInc','Population','Latitude'],
['MedInc','AveRooms','AveOccup']],
'Ridge__alpha':[0.001,0.01,0.1,1,10]}, estimator=pipe)
X = df.drop('houseval', axis = 1).values
y = df.loc[:,'houseval'].values
# gridsearch.fit(X=X,y=y)
Upvotes: 0
Views: 889
Reputation: 372
I deviated somewhat from your original code -- mostly because the intricacies of using a custom estimator to achieve your desired column select transformation proved to be too much overhead for me.
This is my solution:
from sklearn.datasets import california_housing
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
import pandas as pd
from sklearn.preprocessing import FunctionTransformer
cal_house = california_housing.fetch_california_housing()
data = cal_house['data']
names = cal_house['feature_names']
df = pd.DataFrame(data, columns=names)
df['houseval'] = cal_house['target']
def keep_columns(X, columns=("MedInc", "HouseAge")):
column_indices = [
names.index(name) for name in columns
]
return X[:, column_indices]
pipe = Pipeline([
("colselect", FunctionTransformer(keep_columns)),
("Ridge", Ridge()),
])
gridsearch = GridSearchCV(
cv=5, scoring='r2',
param_grid={
'colselect__inv_kw_args': [
{"columns": columns}
for columns in [
['MedInc', 'HouseAge'],
['MedInc', 'Population', 'Latitude'],
['MedInc', 'AveRooms', 'AveOccup']
]
],
'Ridge__alpha': [0.001, 0.01, 0.1, 1, 10]
},
estimator=pipe
)
X = df.drop('houseval', axis=1).values
y = df.loc[:, 'houseval'].values
gridsearch.fit(X=X, y=y)
The main issues with your code:
Using a custom class in the pipeline to achieve your transformation is excessive -- it is basically just an array access.
Your transformation code in the class assumed a DataFrame
, but gridsearch passes numpy.array
objects instead. In this numpy array, accesses need to be done by index, so my code calculates these indices from your feature names array names
.
A custom estimator requires you to provide a method to copy estimators that must ensure proper copying of all parameters, otherwise it appears as though they become None
, since sklearn.model_selection.GridSearchCV
attempts to copy them without deep copy.
Upvotes: 1