Reputation: 622
I am unable to use remainder='passthrough' whenever I am using the StandardScaler and OnehotEncoding at the same time. Whichever way I am putting it, I have a problem. it's either keyword before argument,a problem with the fit_tranform... you name it. Here what I am doing :
trans_cols= make_column_transformer((OneHotEncoder(),['job', 'marital', 'education',
'default','housing','loan','contact','month','poutcome']),remainder='passthrough')
trans_cols.fit_transform(X)
here are my columns:
Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
'loan', 'contact', 'month', 'duration', 'campaign', 'pdays', 'previous',
'poutcome', 'y'],
dtype='object')
The code above works, I am just not able to combine the 2 estimators when using the remainder key argument. Here is why I am trying:
trans_cols= make_column_transformer((OneHotEncoder(),['job', 'marital', 'education', 'default','housing','loan',
'contact','month','poutcome']),remainder='passthrough',
(StandardScaler(),['age', 'job', 'marital', 'education', 'default', 'balance',
'housing','loan', 'contact', 'month', 'duration',
'campaign', 'pdays', 'previous','poutcome']))
However, the above does not work until I remove remainder
and keep 2 tuples. Which understandable. however, doing that it is trying to encode some of my numeric and I have a a message telling that it encountered some columns that have float.Plus my accuracy drops severely.
Upvotes: 3
Views: 8001
Reputation: 9
#Basic Transformer
import numpy as np from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer encoder=ColumnTransformer(transformers=[('one_hot',OneHotEncoder(),['Name1','Char1'])],remainder='passthrough') encoded_data=encoder.fit_transform(others_data)"
"Ensure that the output from the OneHotEncoder is as expected. OneHotEncoder may produce a sparse matrix by default, which needs to be converted to a dense format before it can be used in a DataFrame."
"If your encoded data is in a sparse matrix format, convert it to a dense format with the .toarray() method when creating the DataFrame:"
encoded_data_dense = encoded_Data.toarray()
encoded_data_df = pd.DataFrame(encoded_data_dense, columns=encoder.get_feature_names_out())
Upvotes: -1
Reputation: 89
Making some additions to KRKirov's answer as it might be useful as well.
As you are using make_column_transformer, it do preprocessing your features in the given order. So the remainder (all features that are untouched during processing) should come at the end.
The problem with your code is that you pass the remainder parameter at the middle, so all features that remained from process goes to there. And you can't do processing after that. So first, do all the special processing first then do processing with other features with remainder parameter.
Here I will explain it with the codes.
(1) importing
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
onhe = OneHotEncoder()
scaler = StandardScaler()
(2) I will create basic DataFrame
df = pd.DataFrame({'sex':['m', 'f','f','m'],
'age':[45,25,10,31],
'married':['y','y','n','y'],
'salary':[1000,300,370,500],
'child':[5, 1,0,3]})
print(df)
(3) Let's say we want to do encoding for sex and married columns, standard scaling for age and salary columns and leave the child column as it is.
transforming = make_column_transformer((onhe,['sex','married']),
(scaler,['age', 'salary']),
remainder = 'passthrough')
processed_df = transforming.fit_transform(df)
print(processed_df)
Note that remainder is being assigned at the end of the process. What is more, if you want to do scaler in all remaining features ('age','salary', 'child'), then you can use:
transforming_1 = make_column_transformer((onhe, ['sex', 'married']), remainder = scaler)
processed_df_1 = transforming_1.fit_transform(df)
print(processed_df_1)
It will encode two given columns then do StandardScaling for all remaining columns.
And when it comes to your situation, your code (from which you got an error), should look like this:
trans_cols= make_column_transformer((OneHotEncoder(),['job', 'marital', 'education', 'default','housing','loan','contact','month','poutcome']),(StandardScaler(),['age', 'job', 'marital', 'education', 'default', 'balance','housing','loan', 'contact', 'month', 'duration','campaign', 'pdays', 'previous','poutcome']),remainder='passthrough')
Upvotes: 2
Reputation: 4004
The preferred practice is not to use StandardScaler on one-hot-encoded columns. The first example below demonstrates the application of OHE on the categorical variables and StandardScaler on the numeric columns. The second example, shows the sequential application of OHE on selected columns and StandardScaler on all columns, but this is not recommended.
Example_1:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
df = pd.DataFrame({'Cat_Var': np.random.choice(['a', 'b'], size=5),
'Num_Var': np.arange(5)})
cat_cols = ['Cat_Var']
num_cols = ['Num_Var']
col_transformer = make_column_transformer(
(OneHotEncoder(), cat_cols),
remainder=StandardScaler())
X = col_transformer.fit_transform(df)
Output:
df
Out[57]:
Cat_Var Num_Var
0 b 0
1 a 1
2 b 2
3 a 3
4 a 4
X
Out[58]:
array([[ 0. , 1. , -1.41421356],
[ 1. , 0. , -0.70710678],
[ 0. , 1. , 0. ],
[ 1. , 0. , 0.70710678],
[ 1. , 0. , 1.41421356]])
Example 2:
col_transformer_2 = ColumnTransformer(
[('cat_transform', OneHotEncoder(), cat_cols)],
remainder='passthrough'
)
pipe = Pipeline(
[
('col_tranform', col_transformer_2),
('standard_scaler', StandardScaler())
])
X_2 = pipe.fit_transform(df)
Output:
X_2
Out[62]:
array([[-1.22474487, 1.22474487, -1.41421356],
[ 0.81649658, -0.81649658, -0.70710678],
[-1.22474487, 1.22474487, 0. ],
[ 0.81649658, -0.81649658, 0.70710678],
[ 0.81649658, -0.81649658, 1.41421356]])
Upvotes: 5