wsilva916
wsilva916

Reputation: 47

Pandas one-hot encoding with multiple like columns

I have several 'condition' columns in a dataset. These columns are all eligible to receive the same coded input. This is only to allow multiple conditions to be associated with a single record - which column the code winds up in carries no meaning.

In the sample below there are only 5 unique values across the 3 condition columns, although if you consider each column separately, there are 3 unique values in each. So when I apply one-hot encoding to these variables together I get 9 new columns, but I only want 5 (one for each unique value in the collective set of columns).

Here is a sample of the original data:

| cond1 | cond2 | cond3 | target |
|-------|-------|-------|--------|
| I219  | E119  | I48   | 1      |
| I500  |       |       | 0      |
| I48   | I500  | F171  | 1      |
| I219  | E119  | I500  | 0      |
| I219  | I48   |       | 0      |

Here's what I tried:

import pandas as pd
df = pd.read_csv('micro.csv', dtype='object')
df['cond1'] = pd.Categorical(df['cond1'])
df['cond2'] = pd.Categorical(df['cond2'])
df['cond3'] = pd.Categorical(df['cond3'])

dummies = pd.get_dummies(df[['cond1', 'cond2', 'cond3']], prefix = 'cond')
dummies

Which gives me:

| cond_I219 | cond_I48 | cond_I500 | cond_E119 | cond_I48 | cond_I500 | cond_F171 | cond_I48 | cond_I500 |
|-----------|----------|-----------|-----------|----------|-----------|-----------|----------|-----------|
| 1         | 0        | 0         | 1         | 0        | 0         | 0         | 1        | 0         |
| 0         | 0        | 1         | 0         | 0        | 0         | 0         | 0        | 0         |
| 0         | 1        | 0         | 0         | 0        | 1         | 1         | 0        | 0         |
| 1         | 0        | 0         | 1         | 0        | 0         | 0         | 0        | 1         |
| 1         | 0        | 0         | 0         | 1        | 0         | 0         | 0        | 0         |

So I have multiple coded columns for any code that appears in more than one column (I48 and I500).. I would like only a single column for each so I can check for correlations between individual codes and my target variable.

Is there a way to do this? This is the result I'm after:

| cond_I219 | cond_I48 | cond_I500 | cond_E119 | cond_F171 |
|-----------|----------|-----------|-----------|-----------|
| 1         | 1        | 0         | 1         | 0         |
| 0         | 0        | 1         | 0         | 0         |
| 0         | 1        | 1         | 0         | 1         |
| 1         | 0        | 1         | 1         | 0         |
| 1         | 1        | 0         | 0         | 0         |

Upvotes: 2

Views: 508

Answers (2)

G.G
G.G

Reputation: 765

pd.get_dummies(df1.iloc[:,:-1]).rename(columns=lambda x:x.split('_')[1]).T.groupby(level=0).sum().T


| cond_I219 | cond_I48 | cond_I500 | cond_E119 | cond_F171 |
|-----------|----------|-----------|-----------|-----------|
| 1         | 1        | 0         | 1         | 0         |
| 0         | 0        | 1         | 0         | 0         |
| 0         | 1        | 1         | 0         | 1         |
| 1         | 0        | 1         | 1         | 0         |
| 1         | 1        | 0         | 0         | 0         |

Upvotes: 0

jezrael
jezrael

Reputation: 862691

Get max values if need 1 and 0 data in output:

dfDummies = dummies.max(axis=1, level=0)

Or use sum if need count 1 values:

dfDummies = dummies.sum(axis=1, level=0)

Upvotes: 2

Related Questions