Davis Stöwer
Davis Stöwer

Reputation: 1

Scikit learn GridSearchCV with pipeline with custom transformer

I'm trying to perform a GridSearchCV on a pipeline with a custom transformer. The transformer enriches the features "year" and "odometer" polynomially and one hot encodes the rest of the features. The ML model is a simple linear regression model.

custom transformer code:

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder 
from sklearn.preprocessing import PolynomialFeatures

class custom_poly_features(TransformerMixin, BaseEstimator):
    def __init__(self, degree = 2, poly_features = ['year', 'odometer']):
        self.degree_ = degree
        self.poly_features_ = poly_features       
    def fit(self, X, y=None):
        # Return the classifier
        return self
    def transform(self, X, y=None):
        poly_feat = PolynomialFeatures(degree=self.degree_)
        OneHot = OneHotEncoder(sparse=False)

        not_poly_features = list(set(X.columns) - set(self.poly_features_))
        poly = poly_feat.fit_transform(X[self.poly_features_].to_numpy())
        poly = np.hstack([poly, OneHot.fit_transform(X[not_poly_features].to_numpy())])

        return poly
    def get_params(self, deep=True):
        return {"degree": self.degree_, "poly_features": self.poly_features_}

pipeline & gridsearch code:

#create pipeline
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

poly_pipeline =  Pipeline(steps=[("cpf", custom_poly_features()), ("lin_reg", LinearRegression(n_jobs=-1))])

#perform gridsearch
from sklearn.model_selection import GridSearchCV
param_grid = {"cpf__degree": [3, 4, 5]}

search = GridSearchCV(poly_pipeline, param_grid, n_jobs=-1, cv=3)
search.fit(X_train_ordinal, y_train)

The custom transformer itself works fine and the pipeline also works (although the score is not great, but that is not the topic here).

poly_pipeline.fit(X_train, y_train).score(X_test, y_test)

Output:
0.543546844381771

However, when I perform the gridsearch, the scores are all nan values:

search.cv_results_

Output:
{'mean_fit_time': array([4.46928191, 4.58259885, 4.55605125]),
 'std_fit_time': array([0.18111937, 0.03305779, 0.02080789]),
 'mean_score_time': array([0.21119197, 0.13816587, 0.11357466]),
 'std_score_time': array([0.09206233, 0.02171508, 0.02127906]),
 'param_custom_poly_features__degree': masked_array(data=[3, 4, 5],
          mask=[False, False, False],
    fill_value='?',
         dtype=object),
 'params': [{'custom_poly_features__degree': 3},
  {'custom_poly_features__degree': 4},
  {'custom_poly_features__degree': 5}],
 'split0_test_score': array([nan, nan, nan]),
 'split1_test_score': array([nan, nan, nan]),
 'split2_test_score': array([nan, nan, nan]),
 'mean_test_score': array([nan, nan, nan]),
 'std_test_score': array([nan, nan, nan]),
 'rank_test_score': array([1, 2, 3])}

Does anyone know what the problem is? The transformer and the pipeline work fine on their own after all.

Upvotes: 0

Views: 1678

Answers (1)

Ben Reiniger
Ben Reiniger

Reputation: 12582

To debug searches in general, set error_score='raise', so that you get a full error traceback.

Your issue appears to be data-dependent; I can run this just fine on a custom dataset. That suggests to me that the comment by @Sanjar Adylov not only highlights an important issue, but the issue for your data: the train folds sometimes contain different values in some categorical feature(s) than the test folds, and so the one-hot encodings end up with different numbers of features, and the linear model justifiably breaks.

So the fix there is also as Sanjar says: instantiate, store as attributes, and fit the two transformers and in your fit method, and use their transform methods in your transform method.

You will find there is another big issue: all the scores in cv_results_ are the same. This is because you can't actually set the hyperparameters correctly, because in __init__ you've used mismatching names (degree as the parameter but degree_ as the attribute). Read more in the developer guide. (I think you can get around this by editing set_params similar to how you edited get_params, but it would be much easier to actually rely on the BaseEstimator versions of those and just match the parameter names to the attribute names.)

Also, note that setting a parameter default to a list can have surprising effects. Consider alternatives to the default of poly_features in __init__.

class custom_poly_features(TransformerMixin, BaseEstimator):
    def __init__(self, degree=2, poly_features=['year', 'odometer']):
        self.degree = degree
        self.poly_features = poly_features

    def fit(self, X, y=None):
        self.poly_feat = PolynomialFeatures(degree=self.degree)
        self.onehot = OneHotEncoder(sparse=False)

        self.not_poly_features_ = list(set(X.columns) - set(self.poly_features))

        self.poly_feat.fit(X[self.poly_features])
        self.onehot.fit(X[self.not_poly_features_])

        return self

    def transform(self, X, y=None):
        poly = self.poly_feat.transform(X[self.poly_features])
        poly = np.hstack([poly, self.onehot.transform(X[self.not_poly_features_])
        return poly

There are some additional things you might want to add, like checks for whether poly_features or not_poly_features_ is empty (which would break the corresponding transformer).


Finally, your custom estimator is just doing what a ColumnTransformer is meant to do. I think the only reason to prefer yours is if you need to search over which columns get which treatment; I don't think that's easy to do with a ColumnTransformer.

custom_poly = ColumnTransformer(
    transformers=[('poly', PolynomialFeatures(), ['year', 'odometer'])],
    remainder=OneHotEncoder(),
)

param_grid = {"cpf__poly__degree": [3, 4, 5]}

Upvotes: 1

Related Questions