Supraja
Supraja

Reputation: 23

Split labels of dataframe with multiple categorical values in python for encoding labels

I have a column like this in a dataset.

print (pharma_data['Treated_with_drugs'].astype('category').cat.categories)
Index(['DX1 ', 'DX1 DX2 ', 'DX1 DX2 DX3 ', 'DX1 DX2 DX3 DX4 ',
       'DX1 DX2 DX3 DX4 DX5 ', 'DX1 DX2 DX3 DX5 ', 'DX1 DX2 DX4 ',
       'DX1 DX2 DX4 DX5 ', 'DX1 DX2 DX5 ', 'DX1 DX3 ', 'DX1 DX3 DX4 ',
       'DX1 DX3 DX4 DX5 ', 'DX1 DX3 DX5 ', 'DX1 DX4 ', 'DX1 DX4 DX5 ',
       'DX1 DX5 ', 'DX2 ', 'DX2 DX3 ', 'DX2 DX3 DX4 ', 'DX2 DX3 DX4 DX5 ',
       'DX2 DX3 DX5 ', 'DX2 DX4 ', 'DX2 DX4 DX5 ', 'DX2 DX5 ', 'DX3 ',
       'DX3 DX4 ', 'DX3 DX4 DX5 ', 'DX3 DX5 ', 'DX4 ', 'DX4 DX5 ', 'DX5 ',
       'DX6'],
      dtype='object')

I want to split that column into 6 columns as: DX1, DX2, DX3, DX4, DX5, DX6 with values as 0 or 1.

For example, if row value is 'DX1 DX2 DX5 ' then,

column names: DX1, DX2, DX3, DX4, DX5, DX6

column values: 1 1 0 0 1 0

How can I do that?

Upvotes: 2

Views: 46

Answers (2)

jezrael
jezrael

Reputation: 862771

Use Series.str.strip with Series.str.get_dummies:

a = ['DX1 ', 'DX1 DX2 ', 'DX1 DX2 DX3 ', 'DX1 DX2 DX3 DX4 ',
       'DX1 DX2 DX3 DX4 DX5 ', 'DX1 DX2 DX3 DX5 ', 'DX1 DX2 DX4 ',
       'DX1 DX2 DX4 DX5 ', 'DX1 DX2 DX5 ', 'DX1 DX3 ', 'DX1 DX3 DX4 ',
       'DX1 DX3 DX4 DX5 ', 'DX1 DX3 DX5 ', 'DX1 DX4 ', 'DX1 DX4 DX5 ',
       'DX1 DX5 ', 'DX2 ', 'DX2 DX3 ', 'DX2 DX3 DX4 ', 'DX2 DX3 DX4 DX5 ',
       'DX2 DX3 DX5 ', 'DX2 DX4 ', 'DX2 DX4 DX5 ', 'DX2 DX5 ', 'DX3 ',
       'DX3 DX4 ', 'DX3 DX4 DX5 ', 'DX3 DX5 ', 'DX4 ', 'DX4 DX5 ', 'DX5 ',
       'DX6']

pharma_data = pd.DataFrame({'Treated_with_drugs':a})

df = pharma_data['Treated_with_drugs'].str.strip().str.get_dummies(' ')
print (df)
    DX1  DX2  DX3  DX4  DX5  DX6
0     1    0    0    0    0    0
1     1    1    0    0    0    0
2     1    1    1    0    0    0
3     1    1    1    1    0    0
4     1    1    1    1    1    0
5     1    1    1    0    1    0
6     1    1    0    1    0    0
7     1    1    0    1    1    0
8     1    1    0    0    1    0
9     1    0    1    0    0    0
10    1    0    1    1    0    0
11    1    0    1    1    1    0
12    1    0    1    0    1    0
13    1    0    0    1    0    0
14    1    0    0    1    1    0
15    1    0    0    0    1    0
16    0    1    0    0    0    0
17    0    1    1    0    0    0
18    0    1    1    1    0    0
19    0    1    1    1    1    0
20    0    1    1    0    1    0
21    0    1    0    1    0    0
22    0    1    0    1    1    0
23    0    1    0    0    1    0
24    0    0    1    0    0    0
25    0    0    1    1    0    0
26    0    0    1    1    1    0
27    0    0    1    0    1    0
28    0    0    0    1    0    0
29    0    0    0    1    1    0
30    0    0    0    0    1    0
31    0    0    0    0    0    1

Upvotes: 2

borisdonchev
borisdonchev

Reputation: 1224

Let

rows = ['DX1 ', 'DX1 DX2 ', 'DX1 DX2 DX3 ', 'DX1 DX2 DX3 DX4 ', 'DX1 DX2 DX3 DX4 DX5 ', 'DX1 DX2 DX3 DX5 ', 'DX1 DX2 DX4 ', 'DX1 DX2 DX4 DX5 ', 'DX1 DX2 DX5 ', 'DX1 DX3 ', 'DX1 DX3 DX4 ', 'DX1 DX3 DX4 DX5 ', 'DX1 DX3 DX5 ', 'DX1 DX4 ', 'DX1 DX4 DX5 ', 'DX1 DX5 ', 'DX2 ', 'DX2 DX3 ', 'DX2 DX3 DX4 ', 'DX2 DX3 DX4 DX5 ', 'DX2 DX3 DX5 ', 'DX2 DX4 ', 'DX2 DX4 DX5 ', 'DX2 DX5 ', 'DX3 ', 'DX3 DX4 ', 'DX3 DX4 DX5 ', 'DX3 DX5 ', 'DX4 ', 'DX4 DX5 ', 'DX5 ', 'DX6']

Then what you want to achieve can be done like so:

columns = [row.split(' ') for row in rows]
unique_columns = set([item for sublist in columns for item in sublist if item != '' ])

df = pd.DataFrame(columns=unique_columns)

for i, row in enumerate(rows): 
    row_dict = {col: 0 for col in unique_columns} 
    for element in row.split(' '): 
        if element != '':
            row_dict[element] = 1
    df.loc[i] = row_dict
    
>>> 
   DX3 DX1 DX4 DX2 DX5 DX6
0    0   1   0   0   0   0
1    0   1   0   1   0   0
2    1   1   0   1   0   0
3    1   1   1   1   0   0
4    1   1   1   1   1   0
5    1   1   0   1   1   0
6    0   1   1   1   0   0
7    0   1   1   1   1   0
8    0   1   0   1   1   0
9    1   1   0   0   0   0
10   1   1   1   0   0   0
11   1   1   1   0   1   0
12   1   1   0   0   1   0
13   0   1   1   0   0   0
14   0   1   1   0   1   0
15   0   1   0   0   1   0
16   0   0   0   1   0   0
17   1   0   0   1   0   0
18   1   0   1   1   0   0
19   1   0   1   1   1   0
20   1   0   0   1   1   0
21   0   0   1   1   0   0
22   0   0   1   1   1   0
23   0   0   0   1   1   0
24   1   0   0   0   0   0
25   1   0   1   0   0   0
26   1   0   1   0   1   0
27   1   0   0   0   1   0
28   0   0   1   0   0   0
29   0   0   1   0   1   0
30   0   0   0   0   1   0
31   0   0   0   0   0   1

Upvotes: 1

Related Questions