Reputation: 93
Imagine I have 4 categories in 3 columns but these categories are repeated amongst columns. For example....
df1 = pd.DataFrame(data=[['a', 'b', 'c'], ['b', 'a', 'd'], ['a', 'c', 'd'], ['b', 'd', 'a']])
0 1 2
0 a b c
1 b a d
2 a c d
3 b d a
and when I transform, I get 8 columns, when I should be getting only 4 (one for each category (a, b, c and d).
ohe = ColumnTransformer([('ohe', OneHotEncoder(categories='auto', sparse=False), [0, 1, 2])], remainder='passthrough')
df2 = ohe.fit_transform(df1)
being df2 the eight column categories, but I want to obtain only four... one for each 'a', 'b', 'c' and 'd' categories distributed in my columns.
Is there any way to obtain this output?
Out[17]:
a b c d
0 1 1 1 0
1 1 1 0 1
2 1 0 1 1
3 1 1 0 1
Upvotes: 0
Views: 103
Reputation: 120399
Update
I want to obtain only four... one for each 'a', 'b', 'c' and 'd' categories distributed in my columns
You can use value_counts
applied on columns axis:
>>> df1.apply(pd.value_counts, axis="columns").fillna(0).astype(int)
a b c d
0 1 1 1 0
1 1 1 0 1
2 1 0 1 1
3 1 1 0 1
Old answer
Some explanations about how encoding working:
>>> df1
W X Y Z
0 a b c c
1 b a a b
2 a c a b
>>> df1.nunique()
W 2 # [a, b]
X 3 # [a, b, c]
Y 2 # [a, c]
Z 2 # [b, c]
For column W
, there are two different values [a, b]
so you need 2 columns to encode them: For instance:
a b
a 1 0
b 0 1
For column X
, there are three different values [a, b, c]
so you need 3 columns to encode them. For instance:
a b c
a 1 0 0
b 0 1 0
c 0 0 1
Note the identity matrix.
Let's use pd.get_dummies
rather than OneHotEncoder
to a better understanding:
>>> pd.get_dummies(df1)
W_a W_b X_a X_b X_c Y_a Y_c Z_b Z_c
0 1 0 0 1 0 0 1 0 1
1 0 1 1 0 0 1 0 1 0
2 1 0 0 0 1 1 0 1 0
The question is why you want to get only 3 columns?
Upvotes: 2