Elias Urra
Elias Urra

Reputation: 93

Using OneHotEncoder in multiple columns with repetead categories amongst columns?

Imagine I have 4 categories in 3 columns but these categories are repeated amongst columns. For example....

df1 = pd.DataFrame(data=[['a', 'b', 'c'], ['b', 'a', 'd'], ['a', 'c', 'd'], ['b', 'd', 'a']])
   0  1  2
0  a  b  c
1  b  a  d
2  a  c  d
3  b  d  a

and when I transform, I get 8 columns, when I should be getting only 4 (one for each category (a, b, c and d).

ohe = ColumnTransformer([('ohe', OneHotEncoder(categories='auto', sparse=False), [0, 1, 2])], remainder='passthrough')

df2 = ohe.fit_transform(df1)

being df2 the eight column categories, but I want to obtain only four... one for each 'a', 'b', 'c' and 'd' categories distributed in my columns.

Is there any way to obtain this output?

Out[17]: 
   a  b  c  d
0  1  1  1  0
1  1  1  0  1
2  1  0  1  1
3  1  1  0  1

Upvotes: 0

Views: 103

Answers (1)

Corralien
Corralien

Reputation: 120399

Update

I want to obtain only four... one for each 'a', 'b', 'c' and 'd' categories distributed in my columns

You can use value_counts applied on columns axis:

>>> df1.apply(pd.value_counts, axis="columns").fillna(0).astype(int)
   a  b  c  d
0  1  1  1  0
1  1  1  0  1
2  1  0  1  1
3  1  1  0  1

Old answer

Some explanations about how encoding working:

>>> df1
   W  X  Y  Z
0  a  b  c  c
1  b  a  a  b
2  a  c  a  b

>>> df1.nunique()
W    2  # [a, b]
X    3  # [a, b, c]
Y    2  # [a, c]
Z    2  # [b, c]

For column W, there are two different values [a, b] so you need 2 columns to encode them: For instance:

   a  b
a  1  0
b  0  1

For column X, there are three different values [a, b, c] so you need 3 columns to encode them. For instance:

   a  b  c
a  1  0  0
b  0  1  0
c  0  0  1

Note the identity matrix.

Let's use pd.get_dummies rather than OneHotEncoder to a better understanding:

>>> pd.get_dummies(df1)
   W_a  W_b  X_a  X_b  X_c  Y_a  Y_c  Z_b  Z_c
0    1    0    0    1    0    0    1    0    1
1    0    1    1    0    0    1    0    1    0
2    1    0    0    0    1    1    0    1    0

The question is why you want to get only 3 columns?

Upvotes: 2

Related Questions