Geo M
Geo M

Reputation: 39

make_pipeline with StandardScalar and KerasRegressors

I'm trying to GridSearchCV epochs and batch_size with the following code:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle=False)

X_train2 = X_train.values.reshape((X_train.shape[0], 1, X_train.shape[1]))
y_train2 = np.ravel(y_train.values)

X_test2 = X_test.values.reshape((X_test.shape[0], 1, X_test.shape[1]))
y_test2 = np.ravel(y_test.values)

def build_model():
    model = Sequential()
    model.add(LSTM(500, input_shape=(1, X_train.shape[1])))
    model.add(Dense(1))
    model.compile(loss="mse", optimizer="adam")
    return model


new_model = KerasRegressor(build_fn=build_model, verbose=0)

pipe = Pipeline([('s', StandardScaler()), ('reg', new_model)])
param_gridd = {'reg__epochs': [5, 6], 'reg__batch_size': [71, 72]}
model = GridSearchCV(estimator=pipe, param_grid=param_gridd)

# ------------------ if the following two lines are uncommented the code works -> problem with Pipeline?
# param_gridd = {'epochs':[5,6], 'batch_size': [71, 72]}
# model = GridSearchCV(estimator=new_model, param_grid=param_gridd)


fitted = model.fit(X_train2, y_train2, validation_data=(X_test2, y_test2), verbose=2, shuffle=False)

and get the following error:

File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 722, in fit
 self._run_search(evaluate_candidates)   
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 1191, in _run_search
 evaluate_candidates(ParameterGrid(self.param_grid))   
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 711, in evaluate_candidates
 cv.split(X, y, groups)))   
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 917, in __call__
 if self.dispatch_one_batch(iterator):   
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 759, in dispatch_one_batch
 self._dispatch(tasks)   
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 716, in _dispatch
 job = self._backend.apply_async(batch, callback=cb)   
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/externals/oblib/_parallel_backends.py", line 182, in apply_async
 result = ImmediateResult(func)   
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 549, in __init__
 self.results = batch()   
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 225, in __call__
 for func, args, kwargs in self.items]   
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 225, in <listcomp>
 for func, args, kwargs in self.items]   
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 528, in _fit_and_score
 estimator.fit(X_train, y_train, **fit_params)   
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py", line 265, in fit
 Xt, fit_params = self._fit(X, y, **fit_params)    
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py", line 202, in _fit
 step, param = pname.split('__', 1)

ValueError: not enough values to unpack (expected 2, got 1)

I suspect that this has something to do with the naming in param_gridd but not really sure what is going on. Note that the code works fine when I eliminate make_pipeline from the code, and GridSearchCV directly on new_model.

Upvotes: 3

Views: 1206

Answers (2)

Venkatachalam
Venkatachalam

Reputation: 16966

I think that problem is with the way fit parameters for KerasRegressor were fed. validation_data, shuffle are not parameters of GridSearchCV, but the reg. Try this!

fitted = model.fit(X_train2, y_train2,**{'reg__validation_data':(X_test2, y_test2),'reg__verbose':2, 'reg__shuffle':False} )

EDIT: Based on the findings of @Vivek kumar, I have wrote a wrapper for your preprocessing.

from sklearn.preprocessing import StandardScaler
class custom_StandardScaler():
    def __init__(self):
        self.scaler =StandardScaler()
    def fit(self,X,y=None):
        self.scaler.fit(X)
        return self
    def transform(self,X,y=None):
        X_new=self.scaler.transform(X)
        X_new = X_new.reshape((X.shape[0], 1, X.shape[1]))
        return X_new

This would help you to implement the standard scaler along with creating a new dimension. Remember we have to convert the evaluation dataset before feeding it as fit_params(), hence a seperate scaler (offline_scaler()) is used to transform that.

from sklearn.datasets import load_boston
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from keras.layers import LSTM
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
import numpy as np

seed = 1

boston = load_boston()
X, y = boston['data'], boston['target']

X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.2, random_state=42)


def build_model():
    model = Sequential()
    model.add(LSTM(5, input_shape=(1, X_train.shape[1])))
    model.add(Dense(1))
    model.compile(loss='mean_squared_error', optimizer='Adam', metrics=['mae'])
    return model


new_model = KerasRegressor(build_fn=build_model, verbose=0)

param_gridd = {'reg__epochs':[2,3], 'reg__batch_size':[16,32]}
pipe = Pipeline([('s', custom_StandardScaler()),('reg', new_model)])

offline_scaler = custom_StandardScaler()
offline_scaler.fit(X_train)
X_eval2 = offline_scaler.transform(X_eval)

model = GridSearchCV(estimator=pipe, param_grid=param_gridd,cv=3)
fitted = model.fit(X_train, y_train,**{'reg__validation_data':(X_eval2, y_eval),'reg__verbose':2, 'reg__shuffle':False} )

Upvotes: 2

Vivek Kumar
Vivek Kumar

Reputation: 36619

As @AI_Learning said, this line should work:

fitted = model.fit(X_train2, y_train2, 
                   reg__validation_data=(X_test2, y_test2), 
                   reg__verbose=2, reg__shuffle=False)

Pipeline requires parameters to be named as "component__parameter". So prepending reg__ to the parameters work.

This however won't work because the StandardScaler will complain about the data dimensions. You see, when you did:

X_train2 = X_train.values.reshape((X_train.shape[0], 1, X_train.shape[1]))
...

X_test2 = X_test.values.reshape((X_test.shape[0], 1, X_test.shape[1]))

You made the X_train2 and X_test2 a 3-D data. This you have done to make it work for LSTM but wont work with StandardScaler because that requires a 2-D data of shape (n_samples, n_features).

If you remove the StandardScaler from your pipe like this:

pipe = Pipeline([('reg', new_model)])

And try the code me and @AI_Learning suggested, it will work. This shows that its nothing to do with pipeline, but your usage of incompatible transformers together.

You can take the StandardScaler out of the pipeline and do this:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle=False)

std = StandardScaler()
X_train = std.fit_transform(X_train)
X_test = std.transform(X_test)

X_train2 = X_train.values.reshape((X_train.shape[0], 1, X_train.shape[1]))
y_train2 = np.ravel(y_train.values)

...
...

Upvotes: 1

Related Questions