How to include label encoding in scikit learn pipeline?

Question

I want to create a pipeline with all the necessary (preprocessing) steps included. In my case that's imputation, encoding for both X and y, scaling, feature selection, and an estimator.

I have written the following code, but it gives me an error ValueError: too many values to unpack (expected 2).

# Creating a scikit learn pipeline for preprocessing

## Selecting categorical and numeric features
numerical_ix = X.select_dtypes(include=np.number).columns
categorical_ix = X.select_dtypes(exclude=np.number).columns

## Create preprocessing pipelines for each datatype 
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('encoder', OrdinalEncoder()),
    ('scaler', StandardScaler())])

## Putting the preprocessing steps together
preprocessor = ColumnTransformer([
        ('numerical', numerical_transformer, numerical_ix),
        ('categorical', categorical_transformer, categorical_ix)],
         remainder='passthrough')


## Create example pipeline with kNN as estimator
example_pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('label', LabelEncoder(), y),
    ('selector', SelectKBest(k=len(X.columns))), # keep the same amount of columns for now
    ('classifier', KNeighborsClassifier())
])

## Test pipeline
example_pipe.fit_transform(X_train, y_train)
example_pipe.score(X_test, y_test)

I've read a bit about this topic and it seems that LabelEncoder() only receives 1 argument. That's why I tried to specify that it should only process y when creating the example_pipe.

Is there a way to include the label encoding in the pipeline or does it have to be done beforehand (e.g. with pandas)? How, if possible, can I include the label encoding in the pipeline?

Ben Reiniger · Accepted Answer

You don't need to label-encode; sklearn classifiers (your KNeighborsClassifier) will do that internally for you.

You cannot transform y in a Pipeline (unless you add it as a column of X, in which case you would need to separate it manually prior to fitting your actual model). You also cannot specify columns to apply transformers to in a Pipeline; for that, see ColumnTransformer (which still won't work to transform y).

This isn't related to your question, but future searchers might stumble across this when trying to ordinally-encode their independent variables. For that, don't use LabelEncoder, use OrdinalEncoder.

How to include label encoding in scikit learn pipeline?

Answers (1)

Related Questions