Reputation: 756
I need to test several category encoders to different columns containing same values. All the values appear in the columns but not at the same row. For example, I could have:
dft = pd.DataFrame({
'col0':["a", "b", "a", "c", "b", "d"],
'col1':["c", "d", "b", "d", "c", "c"],
'col2':["b", "a", "c", "b", "a", "a"],
})
col0 col1 col2
0 a c b
1 b d a
2 a b c
3 c d b
4 b c a
5 d c a
I could not have in the first row "a", "c", "c"
To encode columns I'm using the Python library category encoders. The problem is that I need to fit the encoder with one column and then apply encoding on multiple columns.
For example given a df
like this:
dft = pd.DataFrame({
'col0':["a", "b", "a", "c", "b", "d"],
'col1':["c", "d", "b", "d", "c", "c"]})
col0 col1
0 a c
1 b d
2 a b
3 c d
4 b c
5 d c
What I'd like to have is:
col0 col1 a b c d
0 a c 1 0 1 0
1 b d 0 1 0 1
2 a b 1 1 0 0
3 c d 0 0 1 1
4 b c 0 1 1 0
5 d c 0 0 1 1
But using category encoders
library I have to fit
a column(s) and apply the transform
to that same column(s).
Using category encoders
on a column this happens:
dft = pd.DataFrame({
'col0':["a", "b", "a", "a", "b", "b"],
'col1':["c", "d", "c", "d", "c", "c"],
})
encoder = ce.OneHotEncoder(cols=None, use_cat_names=True) # encoding example to visualize better the problem
encoder.fit(dft['col0'])
encoder.transform(dft['col0'])
Output:
col0_a col0_b col0_c col0_d
0 1 0 0 0
1 0 1 0 0
2 1 0 0 0
3 0 0 1 0
4 0 1 0 0
5 0 0 0 1
Then apply transformation to the other column:
encoder.transform(dft['col1'])
Output:
KeyError: 'col0'
If the fit is done on both column (since col0 and col1 contain same unique values) the output is:
encoder.fit(dft[['col0','col1']])
encoder.transform(dft[['col0','col1']])
col0_a col0_b col0_c col0_d col1_c col1_d col1_b
0 1 0 0 0 1 0 0
1 0 1 0 0 0 1 0
2 1 0 0 0 0 0 1
3 0 0 1 0 0 1 0
4 0 1 0 0 1 0 0
5 0 0 0 1 1 0 0
The example above is just a method to encode my columns, my goal is trying different methods, there are other libraries to do this encoding without applying transform method only to the fitted columns (without writing every category encoding method from scratch)?
Upvotes: 2
Views: 1006
Reputation: 662
I actually would prefer encoding every single column with a separate encoder, and I believe this behavior, you've described, is intentional. You could have columns car color
and phone color
being both red
resulting in a same feature red=True
indifferent to whether it was car or phone. But if you really want to achieve this, you could do a simple post-processing like this:
categories = ['a', 'b', 'c', 'd']
columns = ['col0_a', 'col0_b', 'col0_c', 'col0_d', 'col1_c', 'col1_d', 'col1_b']
for category in categories:
sum_columns = []
for col in columns:
if col.endswith(f'_{category}'):
sum_columns.append(col)
df[category] = df[sum_columns].sum(axis=1).astype(bool).astype(int)
df = df.drop(columns, axis=1)
Upvotes: 1
Reputation: 71689
You can stack
the dataframe to reshape then use str.get_dummies
to create a dataframe of indicator variables for the stacked frame, finally take sum
on level=0
:
enc = dft.stack().str.get_dummies().sum(level=0)
out = dft.join(enc)
>>> out
col0 col1 a b c d
0 a c 1 0 1 0
1 b d 0 1 0 1
2 a b 1 1 0 0
3 c d 0 0 1 1
4 b c 0 1 1 0
5 d c 0 0 1 1
Upvotes: 1