chinna g
chinna g

Reputation: 405

How to use sklearn Column Transformer?

I'm trying to convert categorical value (in my case it is country column) into encoded value using LabelEncoder and then with OneHotEncoder and was able to convert the categorical value. But i'm getting warning like OneHotEncoder 'categorical_features' keyword is deprecated "use the ColumnTransformer instead." So how i can use ColumnTransformer to achieve same result ?

Below is my input data set and the code which i tried

Input Data set

Country Age Salary
France  44  72000
Spain   27  48000
Germany 30  54000
Spain   38  61000
Germany 40  67000
France  35  58000
Spain   26  52000
France  48  79000
Germany 50  83000
France  37  67000

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

#X is my dataset variable name

label_encoder = LabelEncoder()
x.iloc[:,0] = label_encoder.fit_transform(x.iloc[:,0]) #LabelEncoder is used to encode the country value
hot_encoder = OneHotEncoder(categorical_features = [0])
x = hot_encoder.fit_transform(x).toarray()

And the output i'm getting as, How can i get the same output with column transformer

0(fran) 1(ger) 2(spain) 3(age)  4(salary)
1         0       0      44        72000
0         0       1      27        48000
0         1       0      30        54000
0         0       1      38        61000
0         1       0      40        67000
1         0       0      35        58000
0         0       1      36        52000
1         0       0      48        79000
0         1       0      50        83000
1         0       0      37        67000

i tried following code

from sklearn.compose import ColumnTransformer, make_column_transformer

preprocess = make_column_transformer(

    ( [0], OneHotEncoder())
x = preprocess.fit_transform(x).toarray()

i was able to encode country column with the above code, but missing age and salary column from x varible after transforming

Upvotes: 26

Views: 50193

Answers (9)

Prayson W. Daniel
Prayson W. Daniel

Reputation: 15588

It is a bit strange to encode continuous data as Salary. It makes no sense unless you have binned your salary to certain ranges/categories. If I were you I would do:

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

numeric_features = ['Salary']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['Age','Country']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

from here you can pipe it with a classifier e.g.

clf = Pipeline(steps=[('preprocessor', preprocessor),
                  ('classifier', LogisticRegression(solver='lbfgs'))])  

Use it as so:,y_train)

this will apply the preprocessor and then pass transformed data to the predictor.


If we want to select data types on the fly, we can modify our preprocessor to use column selector by data dtypes:

from sklearn.compose import make_column_selector as selector

preprocessor = ColumnTransformer(
        ('num', numeric_transformer, selector(dtype_include="numeric")),
        ('cat', categorical_transformer, selector(dtype_include="category"))])

Using GridSearch

param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__C': [0.1, 1.0, 10, 100],
    'classifier__solver': ['lbfgs', 'sag'],

grid_search = GridSearchCV(clf, param_grid, cv=10),y_train)

Getting names of features

preprocessor = ColumnTransformer(
        ('num', numeric_transformer, selector(dtype_include="numeric")),
        ('cat', categorical_transformer, selector(dtype_include="category"))],
    verbose_feature_names_out=False, # added this line

# now we can access feature names with

clf[:-1]. get_feature_names_out() # step before estimator

Upvotes: 32


Reputation: 11

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
print(X[:, 0])
ct = ColumnTransformer([("Country", OneHotEncoder(), [1])], remainder = 'passthrough')
#onehotencoder = OneHotEncoder(categorical_features = [0])
X = ct.fit_transform(X).toarray()

Upvotes: 0


Reputation: 246

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_previsores = LabelEncoder()

onehotencorder = ColumnTransformer(transformers=[("OneHot", OneHotEncoder(), [0])],remainder='passthrough')
x= onehotencorder.fit_transform(x).toarray()

the great advantage of OneHotEnocoder is to convert several columns at once, see the example passing several columns

onehotencorder = ColumnTransformer(transformers=[("OneHot", OneHotEncoder(), [1,3,5,6,7,8,9,13])],remainder='passthrough')

if it's a single column, you can do it the traditional way

from sklearn.preprocessing import LabelEncoder
labelencoder_predictors = LabelEncoder()
x[:,0] = labelencoder_predictors.fit_transform(x[:,0])

another suggestion.

Do not use variables with the name of x, y, z put what it represents, example: predictors, classes, countries, ecc.

Upvotes: 0

Suresh Mangs
Suresh Mangs

Reputation: 725

You can directly use the OneHotEncoder and doesn't need to use LabelEncoder

#  Encoding categorical data
from sklearn.preprocessing import OneHotEncoder
transformer = ColumnTransformer(
         [0]              # country column or the column on which categorical operation to be performed
X = transformer.fit_transform(X.tolist())

Upvotes: 1


Reputation: 839

Simplest Method is use pandas dummies on your CVS Data Frame

dataset = pd.read_csv("yourfile.csv")
dataset = pd.get_dummies(dataset,columns=['Country'])

finished Your dataset will look like this Output

Upvotes: 5


Reputation: 1

Since you are transforming only country column (i.e., [0] in your example). Use remainder="passthrough" to get remaining columns so that you will get those columns as it is.


from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
preprocess = ColumnTransformer(transformers=[('onehot', OneHotEncoder() 
x = np.array(preprocess.fit_transform(x),

Upvotes: 0

Arvind Chavhan
Arvind Chavhan

Reputation: 488

from sklearn.compose import make_column_transformer
preprocess = make_column_transformer(
    (OneHotEncoder(categories='auto'), [0]), 
X = preprocess.fit_transform(X)

I fixed the same issue using the above code.

Upvotes: 2


Reputation: 11

@Fawwaz Yusran To tackle this warning...

FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values. If you want the future behaviour and silence this warning, you can specify "categories='auto'". In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly. warnings.warn(msg, FutureWarning)

Remove the following...

labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

Since you are using OneHotEncoder directly you don't need LabelEncoder.

Upvotes: 1


Reputation: 131

I think the poster is not trying to transform the Age and Salary. From the documentation (, you ColumnTransformer (and make_column_transformer) only columns specified in the transformer (i.e., [0] in your example). You should set remainder="passthrough" to get the rest of the columns. In other words:

preprocessor = make_column_transformer( (OneHotEncoder(),[0]),remainder="passthrough")
x = preprocessor.fit_transform(x)

Upvotes: 13

Related Questions