Reputation: 23
I have a column like this in a dataset.
print (pharma_data['Treated_with_drugs'].astype('category').cat.categories)
Index(['DX1 ', 'DX1 DX2 ', 'DX1 DX2 DX3 ', 'DX1 DX2 DX3 DX4 ',
'DX1 DX2 DX3 DX4 DX5 ', 'DX1 DX2 DX3 DX5 ', 'DX1 DX2 DX4 ',
'DX1 DX2 DX4 DX5 ', 'DX1 DX2 DX5 ', 'DX1 DX3 ', 'DX1 DX3 DX4 ',
'DX1 DX3 DX4 DX5 ', 'DX1 DX3 DX5 ', 'DX1 DX4 ', 'DX1 DX4 DX5 ',
'DX1 DX5 ', 'DX2 ', 'DX2 DX3 ', 'DX2 DX3 DX4 ', 'DX2 DX3 DX4 DX5 ',
'DX2 DX3 DX5 ', 'DX2 DX4 ', 'DX2 DX4 DX5 ', 'DX2 DX5 ', 'DX3 ',
'DX3 DX4 ', 'DX3 DX4 DX5 ', 'DX3 DX5 ', 'DX4 ', 'DX4 DX5 ', 'DX5 ',
'DX6'],
dtype='object')
I want to split that column into 6 columns as: DX1, DX2, DX3, DX4, DX5, DX6 with values as 0 or 1.
For example, if row value is 'DX1 DX2 DX5 ' then,
column names: DX1, DX2, DX3, DX4, DX5, DX6
column values: 1 1 0 0 1 0
How can I do that?
Upvotes: 2
Views: 46
Reputation: 862771
Use Series.str.strip
with Series.str.get_dummies
:
a = ['DX1 ', 'DX1 DX2 ', 'DX1 DX2 DX3 ', 'DX1 DX2 DX3 DX4 ',
'DX1 DX2 DX3 DX4 DX5 ', 'DX1 DX2 DX3 DX5 ', 'DX1 DX2 DX4 ',
'DX1 DX2 DX4 DX5 ', 'DX1 DX2 DX5 ', 'DX1 DX3 ', 'DX1 DX3 DX4 ',
'DX1 DX3 DX4 DX5 ', 'DX1 DX3 DX5 ', 'DX1 DX4 ', 'DX1 DX4 DX5 ',
'DX1 DX5 ', 'DX2 ', 'DX2 DX3 ', 'DX2 DX3 DX4 ', 'DX2 DX3 DX4 DX5 ',
'DX2 DX3 DX5 ', 'DX2 DX4 ', 'DX2 DX4 DX5 ', 'DX2 DX5 ', 'DX3 ',
'DX3 DX4 ', 'DX3 DX4 DX5 ', 'DX3 DX5 ', 'DX4 ', 'DX4 DX5 ', 'DX5 ',
'DX6']
pharma_data = pd.DataFrame({'Treated_with_drugs':a})
df = pharma_data['Treated_with_drugs'].str.strip().str.get_dummies(' ')
print (df)
DX1 DX2 DX3 DX4 DX5 DX6
0 1 0 0 0 0 0
1 1 1 0 0 0 0
2 1 1 1 0 0 0
3 1 1 1 1 0 0
4 1 1 1 1 1 0
5 1 1 1 0 1 0
6 1 1 0 1 0 0
7 1 1 0 1 1 0
8 1 1 0 0 1 0
9 1 0 1 0 0 0
10 1 0 1 1 0 0
11 1 0 1 1 1 0
12 1 0 1 0 1 0
13 1 0 0 1 0 0
14 1 0 0 1 1 0
15 1 0 0 0 1 0
16 0 1 0 0 0 0
17 0 1 1 0 0 0
18 0 1 1 1 0 0
19 0 1 1 1 1 0
20 0 1 1 0 1 0
21 0 1 0 1 0 0
22 0 1 0 1 1 0
23 0 1 0 0 1 0
24 0 0 1 0 0 0
25 0 0 1 1 0 0
26 0 0 1 1 1 0
27 0 0 1 0 1 0
28 0 0 0 1 0 0
29 0 0 0 1 1 0
30 0 0 0 0 1 0
31 0 0 0 0 0 1
Upvotes: 2
Reputation: 1224
Let
rows = ['DX1 ', 'DX1 DX2 ', 'DX1 DX2 DX3 ', 'DX1 DX2 DX3 DX4 ', 'DX1 DX2 DX3 DX4 DX5 ', 'DX1 DX2 DX3 DX5 ', 'DX1 DX2 DX4 ', 'DX1 DX2 DX4 DX5 ', 'DX1 DX2 DX5 ', 'DX1 DX3 ', 'DX1 DX3 DX4 ', 'DX1 DX3 DX4 DX5 ', 'DX1 DX3 DX5 ', 'DX1 DX4 ', 'DX1 DX4 DX5 ', 'DX1 DX5 ', 'DX2 ', 'DX2 DX3 ', 'DX2 DX3 DX4 ', 'DX2 DX3 DX4 DX5 ', 'DX2 DX3 DX5 ', 'DX2 DX4 ', 'DX2 DX4 DX5 ', 'DX2 DX5 ', 'DX3 ', 'DX3 DX4 ', 'DX3 DX4 DX5 ', 'DX3 DX5 ', 'DX4 ', 'DX4 DX5 ', 'DX5 ', 'DX6']
Then what you want to achieve can be done like so:
columns = [row.split(' ') for row in rows]
unique_columns = set([item for sublist in columns for item in sublist if item != '' ])
df = pd.DataFrame(columns=unique_columns)
for i, row in enumerate(rows):
row_dict = {col: 0 for col in unique_columns}
for element in row.split(' '):
if element != '':
row_dict[element] = 1
df.loc[i] = row_dict
>>>
DX3 DX1 DX4 DX2 DX5 DX6
0 0 1 0 0 0 0
1 0 1 0 1 0 0
2 1 1 0 1 0 0
3 1 1 1 1 0 0
4 1 1 1 1 1 0
5 1 1 0 1 1 0
6 0 1 1 1 0 0
7 0 1 1 1 1 0
8 0 1 0 1 1 0
9 1 1 0 0 0 0
10 1 1 1 0 0 0
11 1 1 1 0 1 0
12 1 1 0 0 1 0
13 0 1 1 0 0 0
14 0 1 1 0 1 0
15 0 1 0 0 1 0
16 0 0 0 1 0 0
17 1 0 0 1 0 0
18 1 0 1 1 0 0
19 1 0 1 1 1 0
20 1 0 0 1 1 0
21 0 0 1 1 0 0
22 0 0 1 1 1 0
23 0 0 0 1 1 0
24 1 0 0 0 0 0
25 1 0 1 0 0 0
26 1 0 1 0 1 0
27 1 0 0 0 1 0
28 0 0 1 0 0 0
29 0 0 1 0 1 0
30 0 0 0 0 1 0
31 0 0 0 0 0 1
Upvotes: 1