Reputation: 405
I'm trying to convert categorical value (in my case it is country column) into encoded value using LabelEncoder and then with OneHotEncoder and was able to convert the categorical value. But i'm getting warning like OneHotEncoder 'categorical_features' keyword is deprecated "use the ColumnTransformer instead." So how i can use ColumnTransformer to achieve same result ?
Below is my input data set and the code which i tried
Input Data set
Country Age Salary
France 44 72000
Spain 27 48000
Germany 30 54000
Spain 38 61000
Germany 40 67000
France 35 58000
Spain 26 52000
France 48 79000
Germany 50 83000
France 37 67000
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
#X is my dataset variable name
label_encoder = LabelEncoder()
x.iloc[:,0] = label_encoder.fit_transform(x.iloc[:,0]) #LabelEncoder is used to encode the country value
hot_encoder = OneHotEncoder(categorical_features = [0])
x = hot_encoder.fit_transform(x).toarray()
And the output i'm getting as, How can i get the same output with column transformer
0(fran) 1(ger) 2(spain) 3(age) 4(salary)
1 0 0 44 72000
0 0 1 27 48000
0 1 0 30 54000
0 0 1 38 61000
0 1 0 40 67000
1 0 0 35 58000
0 0 1 36 52000
1 0 0 48 79000
0 1 0 50 83000
1 0 0 37 67000
i tried following code
from sklearn.compose import ColumnTransformer, make_column_transformer
preprocess = make_column_transformer(
( [0], OneHotEncoder())
)
x = preprocess.fit_transform(x).toarray()
i was able to encode country column with the above code, but missing age and salary column from x varible after transforming
Upvotes: 26
Views: 50013
Reputation: 15558
It is a bit strange to encode continuous data as Salary. It makes no sense unless you have binned your salary to certain ranges/categories. If I were you I would do:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
numeric_features = ['Salary']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = ['Age','Country']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
from here you can pipe it with a classifier e.g.
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='lbfgs'))])
Use it as so:
clf.fit(X_train,y_train)
this will apply the preprocessor and then pass transformed data to the predictor.
If we want to select data types on the fly, we can modify our preprocessor to use column selector by data dtypes:
from sklearn.compose import make_column_selector as selector
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, selector(dtype_include="numeric")),
('cat', categorical_transformer, selector(dtype_include="category"))])
Using GridSearch
param_grid = {
'preprocessor__num__imputer__strategy': ['mean', 'median'],
'classifier__C': [0.1, 1.0, 10, 100],
'classifier__solver': ['lbfgs', 'sag'],
}
grid_search = GridSearchCV(clf, param_grid, cv=10)
grid_search.fit(X_train,y_train)
Getting names of features
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, selector(dtype_include="numeric")),
('cat', categorical_transformer, selector(dtype_include="category"))],
verbose_feature_names_out=False, # added this line
)
# now we can access feature names with
clf[:-1]. get_feature_names_out() # step before estimator
Upvotes: 32
Reputation: 11
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
print(X[:, 0])
ct = ColumnTransformer([("Country", OneHotEncoder(), [1])], remainder = 'passthrough')
#onehotencoder = OneHotEncoder(categorical_features = [0])
X = ct.fit_transform(X).toarray()
Upvotes: 0
Reputation: 236
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_previsores = LabelEncoder()
onehotencorder = ColumnTransformer(transformers=[("OneHot", OneHotEncoder(), [0])],remainder='passthrough')
x= onehotencorder.fit_transform(x).toarray()
the great advantage of OneHotEnocoder is to convert several columns at once, see the example passing several columns
onehotencorder = ColumnTransformer(transformers=[("OneHot", OneHotEncoder(), [1,3,5,6,7,8,9,13])],remainder='passthrough')
if it's a single column, you can do it the traditional way
from sklearn.preprocessing import LabelEncoder
labelencoder_predictors = LabelEncoder()
x[:,0] = labelencoder_predictors.fit_transform(x[:,0])
another suggestion.
Do not use variables with the name of x, y, z put what it represents, example: predictors, classes, countries, ecc.
Upvotes: 0
Reputation: 725
You can directly use the OneHotEncoder
and doesn't need to use LabelEncoder
# Encoding categorical data
from sklearn.preprocessing import OneHotEncoder
transformer = ColumnTransformer(
transformers=[
("OneHotEncoder",
OneHotEncoder(),
[0] # country column or the column on which categorical operation to be performed
)
],
remainder='passthrough'
)
X = transformer.fit_transform(X.tolist())
Upvotes: 1
Reputation: 839
Simplest Method is use pandas dummies on your CVS Data Frame
dataset = pd.read_csv("yourfile.csv")
dataset = pd.get_dummies(dataset,columns=['Country'])
finished Your dataset will look like this
Upvotes: 5
Reputation: 1
Since you are transforming only country column (i.e., [0] in your example). Use remainder="passthrough"
to get remaining columns so that you will get those columns as it is.
try:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
labelencoder=LabelEncoder()
x[:,0]=labelencoder.fit_transform(x[:,0])
preprocess = ColumnTransformer(transformers=[('onehot', OneHotEncoder()
[0])],remainder="passthrough")
x = np.array(preprocess.fit_transform(x), dtype=np.int)
Upvotes: 0
Reputation: 488
from sklearn.compose import make_column_transformer
preprocess = make_column_transformer(
(OneHotEncoder(categories='auto'), [0]),
remainder="passthrough")
X = preprocess.fit_transform(X)
I fixed the same issue using the above code.
Upvotes: 2
Reputation: 11
@Fawwaz Yusran To tackle this warning...
FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values. If you want the future behaviour and silence this warning, you can specify "categories='auto'". In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly. warnings.warn(msg, FutureWarning)
Remove the following...
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
Since you are using OneHotEncoder directly you don't need LabelEncoder.
Upvotes: 1
Reputation: 131
I think the poster is not trying to transform the Age and Salary. From the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html), you ColumnTransformer (and make_column_transformer) only columns specified in the transformer (i.e., [0] in your example). You should set remainder="passthrough" to get the rest of the columns. In other words:
preprocessor = make_column_transformer( (OneHotEncoder(),[0]),remainder="passthrough")
x = preprocessor.fit_transform(x)
Upvotes: 13