Zolzaya Luvsandorj
Zolzaya Luvsandorj

Reputation: 635

How to leave numerical columns out when using sklearn OneHotEncoder?

Environment:

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier

Sample data:

X_train = pd.DataFrame({'A': ['a1', 'a3', 'a2'], 
                        'B': ['b2', 'b1', 'b3'],
                        'C': [1, 2, 3]})
y_train = pd.DataFrame({'Y': [1,0,1]})

Desired outcome: I would like to include sklearn OneHotEncoder in my pipeline in this format:

encoder = ### SOME CODE ###
scaler = StandardScaler()
model = RandomForestClassifier(random_state=0)

# This is my ideal pipeline
pipe = Pipeline([('OneHotEncoder', encoder),
                 ('Scaler', scaler),
                 ('Classifier', model)])
pipe.fit(X_train, y_train)

Challenge: OneHotEncoder is encoding everything including the numerical columns. I want to keep numerical columns as it is and encode only categorical features in an efficient way that's compatible with Pipeline().

encoder = OneHotEncoder(drop='first', sparse=False) 
encoder.fit(X_train)
encoder.transform(X_train) # Columns C is encoded - this is what I want to avoid

Work around (not ideal): I can get around the problem using pd.get_dummies(). However, this means I can't include it in my pipeline. Or is there a way?

X_train = pd.get_dummies(X_train, drop_first=True)

Upvotes: 5

Views: 3273

Answers (2)

MaximeKan
MaximeKan

Reputation: 4221

My preferred solution for this would be to use sklearn's ColumnTransformer (see here).

It enables you to split the data in as many groups as you want (in your case, categorical vs numerical data) and apply different preprocessing operations to these groups. This transformer can then be used in a pipeline as any other sklearn preprocessing tool. Here is a short example:

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

X = pd.DataFrame({"a":[1,2,3],"b":["A","A","B"]})
y = np.array([0,1,1])

OHE = OneHotEncoder()
scaler = StandardScaler()
RFC = RandomForestClassifier()

cat_cols = ["b"]
num_cols = ["a"]

transformer = ColumnTransformer([('cat_cols', OHE, cat_cols),
                                ('num_cols', scaler, num_cols)])

pipe = Pipeline([("preprocessing", transformer),
                ("classifier", RFC)])
pipe.fit(X,y)

NB: I have taken some license with your request because this only applies the scaler to the numerical data, which I believe makes more sense? If you do want to apply the scaler to all columns, you can do this as well by modifying this example.

Upvotes: 2

Seleme
Seleme

Reputation: 251

What I would do is to create my own custom transformer and put it into pipeline. With this way, you will have a lot of power over the data in your hand. So, the steps are like below:

1) Create a custom transformer class inheriting BaseEstimator and TransformerMixin. In its transform() function try to detect the values of that column is either numerical or categorical. If you do not want to deal with the logic right now, you can always give column name for categorical columns to your transform() function to select on the fly.

2) (Optional) Create your custom transformer to handle columns with only categorical values.

3) (Optional) Create your custom transformer to handle columns with only numerical values.

4) Build two pipelines (one for categorical, the other for numerical) using transformers you created and you can also use the existing ones from sklearn.

5) Merge two pipelines with FeatureUnion.

6) Merge your big pipeline with your ML model.

7) Call fit_transform()

The sample code (no optionals implemented): GitHub Jupyter Noteboook

Upvotes: 1

Related Questions