Reputation: 2426
I am using OneHotEncoder to encode few categorical variables (eg - Sex and AgeGroup). The resulting feature names from the encoder are like - 'x0_female', 'x0_male', 'x1_0.0', 'x1_15.0' etc.
>>> train_X = pd.DataFrame({'Sex':['male', 'female']*3, 'AgeGroup':[0,15,30,45,60,75]})
>>> from sklearn.preprocessing import OneHotEncoder
>>> encoder = OneHotEncoder()
>>> train_X_encoded = encoder.fit_transform(train_X[['Sex', 'AgeGroup']])
>>> encoder.get_feature_names()
>>> array(['x0_female', 'x0_male', 'x1_0.0', 'x1_15.0', 'x1_30.0', 'x1_45.0',
'x1_60.0', 'x1_75.0'], dtype=object)
Is there a way to tell OneHotEncoder
to create the feature names in such a way that the column name is added at the beginning, something like - Sex_female, AgeGroup_15.0 etc, similar to what Pandas get_dummies()
does.
Upvotes: 63
Views: 52234
Reputation: 1106
A list with the original column names can be passed to get_feature_names
.
>>> encoder.get_feature_names(['Sex', 'AgeGroup'])
array(['Sex_female', 'Sex_male', 'AgeGroup_0', 'AgeGroup_15',
'AgeGroup_30', 'AgeGroup_45', 'AgeGroup_60', 'AgeGroup_75'],
dtype=object)
get_feature_names
is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out
instead.
>>> encoder.get_feature_names_out(['Sex', 'AgeGroup'])
array(['Sex_female', 'Sex_male', 'AgeGroup_0', 'AgeGroup_15',
'AgeGroup_30', 'AgeGroup_45', 'AgeGroup_60', 'AgeGroup_75'],
dtype=object)
Upvotes: 81
Reputation: 1614
type(train_X_encoded)
→ scipy.sparse.csr.csr_matrix
pandas.DataFrame.sparse.from_spmatrix
to load a sparse matrix, otherwise convert to a dense matrix and load with pandas.DataFrame
.# pandas.DataFrame.sparse.from_spmatrix will load this sparse matrix
>>> print(train_X_encoded)
(0, 1) 1.0
(0, 2) 1.0
(1, 0) 1.0
(1, 3) 1.0
(2, 1) 1.0
(2, 4) 1.0
(3, 0) 1.0
(3, 5) 1.0
(4, 1) 1.0
(4, 6) 1.0
(5, 0) 1.0
(5, 7) 1.0
# pandas.DataFrame will load this dense matrix
>>> print(train_X_encoded.todense())
[[0. 1. 1. 0. 0. 0. 0. 0.]
[1. 0. 0. 1. 0. 0. 0. 0.]
[0. 1. 0. 0. 1. 0. 0. 0.]
[1. 0. 0. 0. 0. 1. 0. 0.]
[0. 1. 0. 0. 0. 0. 1. 0.]
[1. 0. 0. 0. 0. 0. 0. 1.]]
import pandas as pd
column_name = encoder.get_feature_names_out(['Sex', 'AgeGroup'])
one_hot_encoded_frame = pd.DataFrame.sparse.from_spmatrix(train_X_encoded, columns=column_name)
# display(one_hot_encoded_frame)
Sex_female Sex_male AgeGroup_0 AgeGroup_15 AgeGroup_30 AgeGroup_45 AgeGroup_60 AgeGroup_75
0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
1 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
2 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0
3 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
4 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
5 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
scikit-learn v1.0
use get_feature_names_out
instead of get_feature_names
Upvotes: 24
Reputation: 41
Thanks for a nice solution. @Nursnaaz The sparse matrix needs to convert into a dense matrix.
column_name = encoder.get_feature_names(['Sex', 'AgeGroup'])
one_hot_encoded_frame = pd.DataFrame(train_X_encoded.todense(), columns= column_name)
scikit-learn v1.0
use get_feature_names_out
instead of get_feature_names
Upvotes: 4