How to leave numerical columns out when using sklearn OneHotEncoder?

Question

Environment:

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier

Sample data:

X_train = pd.DataFrame({'A': ['a1', 'a3', 'a2'], 
                        'B': ['b2', 'b1', 'b3'],
                        'C': [1, 2, 3]})
y_train = pd.DataFrame({'Y': [1,0,1]})

Desired outcome: I would like to include sklearn OneHotEncoder in my pipeline in this format:

encoder = ### SOME CODE ###
scaler = StandardScaler()
model = RandomForestClassifier(random_state=0)

# This is my ideal pipeline
pipe = Pipeline([('OneHotEncoder', encoder),
                 ('Scaler', scaler),
                 ('Classifier', model)])
pipe.fit(X_train, y_train)

Challenge: OneHotEncoder is encoding everything including the numerical columns. I want to keep numerical columns as it is and encode only categorical features in an efficient way that's compatible with Pipeline().

encoder = OneHotEncoder(drop='first', sparse=False) 
encoder.fit(X_train)
encoder.transform(X_train) # Columns C is encoded - this is what I want to avoid

Work around (not ideal): I can get around the problem using pd.get_dummies(). However, this means I can't include it in my pipeline. Or is there a way?

X_train = pd.get_dummies(X_train, drop_first=True)

MaximeKan · Accepted Answer

My preferred solution for this would be to use sklearn's ColumnTransformer (see here).

It enables you to split the data in as many groups as you want (in your case, categorical vs numerical data) and apply different preprocessing operations to these groups. This transformer can then be used in a pipeline as any other sklearn preprocessing tool. Here is a short example:

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

X = pd.DataFrame({"a":[1,2,3],"b":["A","A","B"]})
y = np.array([0,1,1])

OHE = OneHotEncoder()
scaler = StandardScaler()
RFC = RandomForestClassifier()

cat_cols = ["b"]
num_cols = ["a"]

transformer = ColumnTransformer([('cat_cols', OHE, cat_cols),
                                ('num_cols', scaler, num_cols)])

pipe = Pipeline([("preprocessing", transformer),
                ("classifier", RFC)])
pipe.fit(X,y)

NB: I have taken some license with your request because this only applies the scaler to the numerical data, which I believe makes more sense? If you do want to apply the scaler to all columns, you can do this as well by modifying this example.

How to leave numerical columns out when using sklearn OneHotEncoder?

Answers (2)

Related Questions