solopiu
solopiu

Reputation: 756

python category encoders on multiple columns

I need to test several category encoders to different columns containing same values. All the values appear in the columns but not at the same row. For example, I could have:

dft = pd.DataFrame({
'col0':["a", "b", "a", "c", "b", "d"], 
'col1':["c", "d", "b", "d", "c", "c"],
'col2':["b", "a", "c", "b", "a", "a"],
})

  col0 col1 col2
0    a    c    b
1    b    d    a
2    a    b    c
3    c    d    b
4    b    c    a
5    d    c    a

I could not have in the first row "a", "c", "c"

To encode columns I'm using the Python library category encoders. The problem is that I need to fit the encoder with one column and then apply encoding on multiple columns. For example given a df like this:

dft = pd.DataFrame({
'col0':["a", "b", "a", "c", "b", "d"], 
'col1':["c", "d", "b", "d", "c", "c"]})

  col0 col1
0    a    c
1    b    d
2    a    b
3    c    d
4    b    c
5    d    c

What I'd like to have is:

  col0 col1  a  b  c  d
0    a    c  1  0  1  0
1    b    d  0  1  0  1
2    a    b  1  1  0  0
3    c    d  0  0  1  1
4    b    c  0  1  1  0
5    d    c  0  0  1  1

But using category encoders library I have to fit a column(s) and apply the transform to that same column(s). Using category encoders on a column this happens:

dft = pd.DataFrame({
'col0':["a", "b", "a", "a", "b", "b"], 
'col1':["c", "d", "c", "d", "c", "c"],
})
encoder = ce.OneHotEncoder(cols=None, use_cat_names=True) # encoding example to visualize better the problem
encoder.fit(dft['col0'])

encoder.transform(dft['col0'])

Output:

   col0_a  col0_b  col0_c  col0_d
0       1       0       0       0
1       0       1       0       0
2       1       0       0       0
3       0       0       1       0
4       0       1       0       0
5       0       0       0       1

Then apply transformation to the other column:

encoder.transform(dft['col1']) 

Output:

KeyError: 'col0'

If the fit is done on both column (since col0 and col1 contain same unique values) the output is:

encoder.fit(dft[['col0','col1']])
encoder.transform(dft[['col0','col1']])

       col0_a  col0_b  col0_c  col0_d  col1_c  col1_d  col1_b
0       1       0       0       0       1       0       0
1       0       1       0       0       0       1       0
2       1       0       0       0       0       0       1
3       0       0       1       0       0       1       0
4       0       1       0       0       1       0       0
5       0       0       0       1       1       0       0

The example above is just a method to encode my columns, my goal is trying different methods, there are other libraries to do this encoding without applying transform method only to the fitted columns (without writing every category encoding method from scratch)?

Upvotes: 2

Views: 1006

Answers (2)

Anvar Kurmukov
Anvar Kurmukov

Reputation: 662

I actually would prefer encoding every single column with a separate encoder, and I believe this behavior, you've described, is intentional. You could have columns car color and phone color being both red resulting in a same feature red=True indifferent to whether it was car or phone. But if you really want to achieve this, you could do a simple post-processing like this:

categories = ['a', 'b', 'c', 'd']
columns = ['col0_a',  'col0_b',  'col0_c',  'col0_d',  'col1_c',  'col1_d',  'col1_b']

for category in categories:
    sum_columns = []
    for col in columns:
        if col.endswith(f'_{category}'):
            sum_columns.append(col)
        df[category] = df[sum_columns].sum(axis=1).astype(bool).astype(int)

df = df.drop(columns, axis=1)

Upvotes: 1

Shubham Sharma
Shubham Sharma

Reputation: 71689

You can stack the dataframe to reshape then use str.get_dummies to create a dataframe of indicator variables for the stacked frame, finally take sum on level=0:

enc = dft.stack().str.get_dummies().sum(level=0)
out = dft.join(enc)

>>> out

  col0 col1  a  b  c  d
0    a    c  1  0  1  0
1    b    d  0  1  0  1
2    a    b  1  1  0  0
3    c    d  0  0  1  1
4    b    c  0  1  1  0
5    d    c  0  0  1  1

Upvotes: 1

Related Questions