Reputation: 622

using make_column_transformer with OnehotEncoder and StandaScaler + passthrough

I am unable to use remainder='passthrough' whenever I am using the StandardScaler and OnehotEncoding at the same time. Whichever way I am putting it, I have a problem. it's either keyword before argument,a problem with the fit_tranform... you name it. Here what I am doing :

trans_cols= make_column_transformer((OneHotEncoder(),['job', 'marital', 'education', 
 'default','housing','loan','contact','month','poutcome']),remainder='passthrough')

trans_cols.fit_transform(X)

here are my columns:
Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
   'loan', 'contact', 'month', 'duration', 'campaign', 'pdays', 'previous',
   'poutcome', 'y'],
  dtype='object')

The code above works, I am just not able to combine the 2 estimators when using the remainder key argument. Here is why I am trying:

trans_cols= make_column_transformer((OneHotEncoder(),['job', 'marital', 'education', 'default','housing','loan',
                                                  'contact','month','poutcome']),remainder='passthrough',

(StandardScaler(),['age', 'job', 'marital', 'education', 'default', 'balance',
                  'housing','loan', 'contact', 'month', 'duration',
                  'campaign', 'pdays', 'previous','poutcome']))

However, the above does not work until I remove remainder and keep 2 tuples. Which understandable. however, doing that it is trying to encode some of my numeric and I have a a message telling that it encountered some columns that have float.Plus my accuracy drops severely.

Upvotes: 3

Answers (3)

Nitin Mohan

Reputation: 9

#Basic Transformer

import numpy as np from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer encoder=ColumnTransformer(transformers=[('one_hot',OneHotEncoder(),['Name1','Char1'])],remainder='passthrough') encoded_data=encoder.fit_transform(others_data)"

"Ensure that the output from the OneHotEncoder is as expected. OneHotEncoder may produce a sparse matrix by default, which needs to be converted to a dense format before it can be used in a DataFrame."

"If your encoded data is in a sparse matrix format, convert it to a dense format with the .toarray() method when creating the DataFrame:"

Convert the sparse matrix to a dense matrix

encoded_data_dense = encoded_Data.toarray()

Then create a DataFrame

encoded_data_df = pd.DataFrame(encoded_data_dense, columns=encoder.get_feature_names_out())

Upvotes: -1

PivotAl

Reputation: 89

Making some additions to KRKirov's answer as it might be useful as well.

As you are using make_column_transformer, it do preprocessing your features in the given order. So the remainder (all features that are untouched during processing) should come at the end.

The problem with your code is that you pass the remainder parameter at the middle, so all features that remained from process goes to there. And you can't do processing after that. So first, do all the special processing first then do processing with other features with remainder parameter.

Here I will explain it with the codes.

(1) importing

import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
onhe = OneHotEncoder()
scaler = StandardScaler()

(2) I will create basic DataFrame

df = pd.DataFrame({'sex':['m', 'f','f','m'],
                       'age':[45,25,10,31], 
                       'married':['y','y','n','y'],
                       'salary':[1000,300,370,500],
                       'child':[5, 1,0,3]})
print(df)

(3) Let's say we want to do encoding for sex and married columns, standard scaling for age and salary columns and leave the child column as it is.

transforming = make_column_transformer((onhe,['sex','married']),
 (scaler,['age', 'salary']),
    remainder = 'passthrough')
     
processed_df = transforming.fit_transform(df)
print(processed_df)

Note that remainder is being assigned at the end of the process. What is more, if you want to do scaler in all remaining features ('age','salary', 'child'), then you can use:

transforming_1 = make_column_transformer((onhe, ['sex', 'married']), remainder = scaler)
processed_df_1 = transforming_1.fit_transform(df)
print(processed_df_1)

It will encode two given columns then do StandardScaling for all remaining columns.

And when it comes to your situation, your code (from which you got an error), should look like this:

trans_cols= make_column_transformer((OneHotEncoder(),['job', 'marital', 'education', 'default','housing','loan','contact','month','poutcome']),(StandardScaler(),['age', 'job', 'marital', 'education', 'default', 'balance','housing','loan', 'contact', 'month', 'duration','campaign', 'pdays', 'previous','poutcome']),remainder='passthrough')

Upvotes: 2

KRKirov

Reputation: 4004

The preferred practice is not to use StandardScaler on one-hot-encoded columns. The first example below demonstrates the application of OHE on the categorical variables and StandardScaler on the numeric columns. The second example, shows the sequential application of OHE on selected columns and StandardScaler on all columns, but this is not recommended.

Example_1:

import numpy as np
import pandas as pd

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline

df = pd.DataFrame({'Cat_Var': np.random.choice(['a', 'b'], size=5),
                   'Num_Var': np.arange(5)})

cat_cols = ['Cat_Var']
num_cols = ['Num_Var']

col_transformer = make_column_transformer(
        (OneHotEncoder(), cat_cols),
        remainder=StandardScaler())

X = col_transformer.fit_transform(df)

Output:

df
Out[57]: 
  Cat_Var  Num_Var
0       b        0
1       a        1
2       b        2
3       a        3
4       a        4

X
Out[58]: 
array([[ 0.        ,  1.        , -1.41421356],
       [ 1.        ,  0.        , -0.70710678],
       [ 0.        ,  1.        ,  0.        ],
       [ 1.        ,  0.        ,  0.70710678],
       [ 1.        ,  0.        ,  1.41421356]])

Example 2:

col_transformer_2 = ColumnTransformer(
        [('cat_transform', OneHotEncoder(), cat_cols)],
        remainder='passthrough'
        )

pipe = Pipeline(
        [
         ('col_tranform', col_transformer_2),
         ('standard_scaler', StandardScaler())
         ])

X_2 = pipe.fit_transform(df)

Output:

X_2
Out[62]: 
array([[-1.22474487,  1.22474487, -1.41421356],
       [ 0.81649658, -0.81649658, -0.70710678],
       [-1.22474487,  1.22474487,  0.        ],
       [ 0.81649658, -0.81649658,  0.70710678],
       [ 0.81649658, -0.81649658,  1.41421356]])

Upvotes: 5

using make_column_transformer with OnehotEncoder and StandaScaler + passthrough

Answers (3)

Convert the sparse matrix to a dense matrix

Then create a DataFrame

Related Questions