Filip Szczybura
Filip Szczybura

Reputation: 437

How can I create a pipeline with encoding for different categorical columns?

I have a problem while trying to implement a pipeline, where I want to use the OrdinalEncoder and OneHotEncoder on different categorical columns.

At this point my code is as following:

X = stroke_df.drop(columns=['id', 'smoking_status', 'stroke'])
y = stroke_df['stroke'].copy()

num_columns = X.select_dtypes(np.number).columns.tolist()
cat_columns = X.select_dtypes('object').columns.tolist()
all_columns = num_columns + cat_columns  # this order will need to be preserved
print('Numerical columns:', ', '.join(num_columns))
print('Categorical columns:', ', '.join(cat_columns))

num_pipeline = Pipeline([
  ('imputer', SimpleImputer(missing_values=np.nan, strategy='median')),
  ('scaler', StandardScaler())
])

cat_pipeline = ColumnTransformer([
  ('label_encoder', LabelEncoder(), ['ever_married', 'work_type']),
  ('one_hot_encoder', OneHotEncoder(), ['gender', 'residence_type'])
])

pipeline = ColumnTransformer([
  ('num', num_pipeline, num_columns),
  ('cat', cat_pipeline, cat_columns)
])

However after trying to call fit_transform on the pipeline and preprocess the input feature matrix I get TypeError:

X_prep = pipeline.fit_transform(X)
TypeError: fit_transform() takes 2 positional arguments but 3 were given

Upvotes: 4

Views: 4339

Answers (2)

r4bc1
r4bc1

Reputation: 1

I have got similar problem and the point was... I just had to name my tranformers with different names... thats it.

preprocessor = ColumnTransformer(
transformers=[
    ('num', numerical_transformer, numerical_cols),
    ('cat_ordinal', categorical_transformer_OE, ordinal_cols),
    ('cat', categorical_transformer_OH, OH_cols)
])

I thoght that I cannot chang things like "num", "cat". Im such and idiot LOL.

(maybe somebody does similar stupid mistake and it might help:))

Upvotes: 0

KRKirov
KRKirov

Reputation: 4004

Your error comes from the use of the LabelEncoder in your pipeline. The documentation states that it should only be used for encoding the y variable. Use ordinal encoder instead if your variables are really ordinal, otherwise use one-hot encoding. The code below also uses a simple pipeline.

import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder

# Set-up
df = pd.DataFrame({'gender': np.random.choice(['M', 'F'], size=5),
                   'ever_married': np.random.choice(['Y', 'N'], size=5),
                   'residence_type': list('ABCDE'),
                   'work_type': list('abcde'),
                   'num_col': np.array([1, 2, np.nan, 3, 4])})

ord_cols = ['ever_married', 'work_type']
ohe_cols = ['gender', 'residence_type']
num_cols = ['num_col']

# Preprocessing pipeline
num_pipeline = Pipeline([
  ('imputer', SimpleImputer(missing_values=np.nan, strategy='median')),
  ('scaler', StandardScaler())
])

pipeline = ColumnTransformer(
    [
     ('num_imputer', num_pipeline, num_cols),
     ('ord_encoder', OrdinalEncoder(), ord_cols),
     ('ohe_encoder', OneHotEncoder(), ohe_cols)
     ]
    )

# Preprocessing
X_prep = pipeline.fit_transform(df)

Ouput:

df

  gender ever_married residence_type work_type  num_col
0      M            Y              A         a      1.0
1      F            Y              B         b      2.0
2      F            Y              C         c      NaN
3      M            Y              D         d      3.0
4      M            N              E         e      4.0

X_prep

array([[-1.5,  1. ,  0. ,  0. ,  1. ,  1. ,  0. ,  0. ,  0. ,  0. ],
       [-0.5,  1. ,  1. ,  1. ,  0. ,  0. ,  1. ,  0. ,  0. ,  0. ],
       [ 0. ,  1. ,  2. ,  1. ,  0. ,  0. ,  0. ,  1. ,  0. ,  0. ],
       [ 0.5,  1. ,  3. ,  0. ,  1. ,  0. ,  0. ,  0. ,  1. ,  0. ],
       [ 1.5,  0. ,  4. ,  0. ,  1. ,  0. ,  0. ,  0. ,  0. ,  1. ]])

Upvotes: 2

Related Questions