thePurplePython
thePurplePython

Reputation: 2767

Integrate Keras to SKLearn Pipeline?

I have a sklearn pipeline performing feature engineering on heterogeneous data types (boolean, categorical, numeric, text) and wanted to try a neural network as my learning algorithm to fit the model. I am running into some problems with the shape of the input data.

I am wondering if what I am trying to do is even possible and or if I should try a different approach?

I have tried a couple different methods but am receiving these errors:

  1. Error when checking input: expected dense_22_input to have shape (11,) but got array with shape (30513,) => I have 11 input features ... so I then tried converting my X and y to arrays and now get this error

  2. ValueError: Specifying the columns using strings is only supported for pandas DataFrames => which I think is because of the ColumnTransformer() where I specify column names

print(X_train_OS.shape)
print(y_train_OS.shape)

(22354, 11)
(22354,)
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import to_categorical # OHE

X_train_predictors = df_train_OS.drop("label", axis=1)
X_train_predictors = X_train_predictors.values
y_train_target = to_categorical(df_train_OS["label"])

y_test_predictors = test_set.drop("label", axis=1)
y_test_predictors = y_test_predictors.values
y_test_target = to_categorical(test_set["label"])

print(X_train_predictors.shape)
print(y_train_target.shape)

(22354, 11)
(22354, 2)
def keras_classifier_wrapper():
    clf = Sequential()
    clf.add(Dense(32, input_dim=11, activation='relu'))
    clf.add(Dense(2, activation='softmax'))
    clf.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
    return clf

TOKENS_ALPHANUMERIC_HYPHEN = "[A-Za-z0-9\-]+(?=\\s+)"

boolTransformer = Pipeline(steps=[
    ('bool', PandasDataFrameSelector(BOOL_FEATURES))])

catTransformer = Pipeline(steps=[
    ('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('cat_ohe', OneHotEncoder(handle_unknown='ignore'))])

numTransformer = Pipeline(steps=[
    ('num_imputer', SimpleImputer(strategy='constant', fill_value=0)),
    ('num_scaler', StandardScaler())])

textTransformer_0 = Pipeline(steps=[
    ('text_bow', CountVectorizer(lowercase=True,\
                                 token_pattern=TOKENS_ALPHANUMERIC_HYPHEN,\
                                 stop_words=stopwords))])

textTransformer_1 = Pipeline(steps=[
    ('text_bow', CountVectorizer(lowercase=True,\
                                 token_pattern=TOKENS_ALPHANUMERIC_HYPHEN,\
                                 stop_words=stopwords))])

FE = ColumnTransformer(
    transformers=[
        ('bool', boolTransformer, BOOL_FEATURES),
        ('cat', catTransformer, CAT_FEATURES),
        ('num', numTransformer, NUM_FEATURES),
        ('text0', textTransformer_0, TEXT_FEATURES[0]),
        ('text1', textTransformer_1, TEXT_FEATURES[1])])

clf = KerasClassifier(keras_classifier_wrapper, epochs=100, batch_size=500, verbose=0)

PL = Pipeline(steps=[('feature_engineer', FE),
                     ('keras_clf', clf)])

PL.fit(X_train_predictors, y_train_target)
#PL.fit(X_train_OS, y_train_OS)

I think I understand the problem here however not sure how to solve it. If it is not possible to integrate sklearn ColumnTransformer+Pipeline into Keras model does Keras have a good way for dealing with fixed data types to feature engineer? Thank you!

Upvotes: 6

Views: 3194

Answers (2)

sslloo
sslloo

Reputation: 521

I think using Sklearn Pipelines and Keras sklearnWrappers is a standard way to dealing with your problem, and ColumnDataTransformer allows you to manage each feature differently( whether it is boolean, numerical or categorical),

To debugg you code, I would suggest to do unit testing on each of the steps of your Pipeline, especially textTransformer_0 and textTransformer_1

For instance

textTransformer_0.fit_transform(X_train_predictors).shape # shape[1]
textTransformer_1.fit_transform(X_train_predictors).shape # shape[1]

And so one for one hot encoder, to understand what will be your final feature dimension.

Because standards for Sklearn Pipelines are to deal with 2D np.ndarray, So CountVectorizer will create a bunch of columns, depending on data, And this value must be introduced as input_dim in keras.Dense layers

Upvotes: 1

Matt
Matt

Reputation: 982

It looks like you are passing your 11 columns of original data through your various column transformers and the number of dimensions is expanding to 30,513 (after count vectorizing your text, one hot encoding etc). Your neural network architecture is set up to accept only 11 input features but is being passed your (now transformed) 30,513 features, which is what error 1 is explaining.

You therefore need to amend the input_dim of your neural network to match the number of features being created in the feature extraction pipeline.

One thing you could do is add an intermediate step between them with something like SelectKBest and set that to something like 20,000 so that you know exactly how many features will eventually be passed to the classifier.

This is a good guide and flowchart on the Google machine learning website - link - look at the flow chart - here you can see they have a 'select top k features' step in the pipeline before training a model.

So, try updating these parts of your code to:

def keras_classifier_wrapper():
    clf = Sequential()
    clf.add(Dense(32, input_dim=20000, activation='relu'))
    clf.add(Dense(2, activation='softmax'))
    clf.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
    return clf

and

from sklearn.feature_selection import SelectKBest
select_best_features = SelectKBest(k=20000)

PL = Pipeline(steps=[('feature_engineer', FE),
                     ('select_k_best', select_best_features),
                     ('keras_clf', clf)])

Upvotes: 5

Related Questions