Reputation: 1
I'm trying to perform a GridSearchCV on a pipeline with a custom transformer. The transformer enriches the features "year" and "odometer" polynomially and one hot encodes the rest of the features. The ML model is a simple linear regression model.
custom transformer code:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures
class custom_poly_features(TransformerMixin, BaseEstimator):
def __init__(self, degree = 2, poly_features = ['year', 'odometer']):
self.degree_ = degree
self.poly_features_ = poly_features
def fit(self, X, y=None):
# Return the classifier
return self
def transform(self, X, y=None):
poly_feat = PolynomialFeatures(degree=self.degree_)
OneHot = OneHotEncoder(sparse=False)
not_poly_features = list(set(X.columns) - set(self.poly_features_))
poly = poly_feat.fit_transform(X[self.poly_features_].to_numpy())
poly = np.hstack([poly, OneHot.fit_transform(X[not_poly_features].to_numpy())])
return poly
def get_params(self, deep=True):
return {"degree": self.degree_, "poly_features": self.poly_features_}
pipeline & gridsearch code:
#create pipeline
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
poly_pipeline = Pipeline(steps=[("cpf", custom_poly_features()), ("lin_reg", LinearRegression(n_jobs=-1))])
#perform gridsearch
from sklearn.model_selection import GridSearchCV
param_grid = {"cpf__degree": [3, 4, 5]}
search = GridSearchCV(poly_pipeline, param_grid, n_jobs=-1, cv=3)
search.fit(X_train_ordinal, y_train)
The custom transformer itself works fine and the pipeline also works (although the score is not great, but that is not the topic here).
poly_pipeline.fit(X_train, y_train).score(X_test, y_test)
Output:
0.543546844381771
However, when I perform the gridsearch, the scores are all nan values:
search.cv_results_
Output:
{'mean_fit_time': array([4.46928191, 4.58259885, 4.55605125]),
'std_fit_time': array([0.18111937, 0.03305779, 0.02080789]),
'mean_score_time': array([0.21119197, 0.13816587, 0.11357466]),
'std_score_time': array([0.09206233, 0.02171508, 0.02127906]),
'param_custom_poly_features__degree': masked_array(data=[3, 4, 5],
mask=[False, False, False],
fill_value='?',
dtype=object),
'params': [{'custom_poly_features__degree': 3},
{'custom_poly_features__degree': 4},
{'custom_poly_features__degree': 5}],
'split0_test_score': array([nan, nan, nan]),
'split1_test_score': array([nan, nan, nan]),
'split2_test_score': array([nan, nan, nan]),
'mean_test_score': array([nan, nan, nan]),
'std_test_score': array([nan, nan, nan]),
'rank_test_score': array([1, 2, 3])}
Does anyone know what the problem is? The transformer and the pipeline work fine on their own after all.
Upvotes: 0
Views: 1678
Reputation: 12582
To debug searches in general, set error_score='raise'
, so that you get a full error traceback.
Your issue appears to be data-dependent; I can run this just fine on a custom dataset. That suggests to me that the comment by @Sanjar Adylov not only highlights an important issue, but the issue for your data: the train folds sometimes contain different values in some categorical feature(s) than the test folds, and so the one-hot encodings end up with different numbers of features, and the linear model justifiably breaks.
So the fix there is also as Sanjar says: instantiate, store as attributes, and fit the two transformers and in your fit
method, and use their transform
methods in your transform
method.
You will find there is another big issue: all the scores in cv_results_
are the same. This is because you can't actually set the hyperparameters correctly, because in __init__
you've used mismatching names (degree
as the parameter but degree_
as the attribute). Read more in the developer guide. (I think you can get around this by editing set_params
similar to how you edited get_params
, but it would be much easier to actually rely on the BaseEstimator
versions of those and just match the parameter names to the attribute names.)
Also, note that setting a parameter default to a list can have surprising effects. Consider alternatives to the default of poly_features
in __init__
.
class custom_poly_features(TransformerMixin, BaseEstimator):
def __init__(self, degree=2, poly_features=['year', 'odometer']):
self.degree = degree
self.poly_features = poly_features
def fit(self, X, y=None):
self.poly_feat = PolynomialFeatures(degree=self.degree)
self.onehot = OneHotEncoder(sparse=False)
self.not_poly_features_ = list(set(X.columns) - set(self.poly_features))
self.poly_feat.fit(X[self.poly_features])
self.onehot.fit(X[self.not_poly_features_])
return self
def transform(self, X, y=None):
poly = self.poly_feat.transform(X[self.poly_features])
poly = np.hstack([poly, self.onehot.transform(X[self.not_poly_features_])
return poly
There are some additional things you might want to add, like checks for whether poly_features
or not_poly_features_
is empty (which would break the corresponding transformer).
Finally, your custom estimator is just doing what a ColumnTransformer
is meant to do. I think the only reason to prefer yours is if you need to search over which columns get which treatment; I don't think that's easy to do with a ColumnTransformer
.
custom_poly = ColumnTransformer(
transformers=[('poly', PolynomialFeatures(), ['year', 'odometer'])],
remainder=OneHotEncoder(),
)
param_grid = {"cpf__poly__degree": [3, 4, 5]}
Upvotes: 1