Reputation: 415
I want to create a pipeline with all the necessary (preprocessing) steps included. In my case that's imputation, encoding for both X
and y
, scaling, feature selection, and an estimator.
I have written the following code, but it gives me an error ValueError: too many values to unpack (expected 2)
.
# Creating a scikit learn pipeline for preprocessing
## Selecting categorical and numeric features
numerical_ix = X.select_dtypes(include=np.number).columns
categorical_ix = X.select_dtypes(exclude=np.number).columns
## Create preprocessing pipelines for each datatype
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('encoder', OrdinalEncoder()),
('scaler', StandardScaler())])
## Putting the preprocessing steps together
preprocessor = ColumnTransformer([
('numerical', numerical_transformer, numerical_ix),
('categorical', categorical_transformer, categorical_ix)],
remainder='passthrough')
## Create example pipeline with kNN as estimator
example_pipe = Pipeline(steps=[
('preprocessor', preprocessor),
('label', LabelEncoder(), y),
('selector', SelectKBest(k=len(X.columns))), # keep the same amount of columns for now
('classifier', KNeighborsClassifier())
])
## Test pipeline
example_pipe.fit_transform(X_train, y_train)
example_pipe.score(X_test, y_test)
I've read a bit about this topic and it seems that LabelEncoder()
only receives 1 argument. That's why I tried to specify that it should only process y
when creating the example_pipe
.
Is there a way to include the label encoding in the pipeline or does it have to be done beforehand (e.g. with pandas)? How, if possible, can I include the label encoding in the pipeline?
Upvotes: 1
Views: 3530
Reputation: 12748
You don't need to label-encode; sklearn classifiers (your KNeighborsClassifier
) will do that internally for you.
You cannot transform y
in a Pipeline
(unless you add it as a column of X
, in which case you would need to separate it manually prior to fitting your actual model). You also cannot specify columns to apply transformers to in a Pipeline
; for that, see ColumnTransformer
(which still won't work to transform y
).
This isn't related to your question, but future searchers might stumble across this when trying to ordinally-encode their independent variables. For that, don't use LabelEncoder
, use OrdinalEncoder
.
Upvotes: 4