Reputation: 47
I am working with a medical database, and I have a dataframe that looks like this:
INC_KEY COMORBID1 COMORBID2 COMORBID3 COMORBID4
0 160389417 Hypertension None None None
1 160789043 COPD Hypertension Diabetes None
2 160039662 Hypertension ADLC Other None
3 160367584 Diabetes None None None
4 160008818 None None None None
As you can see, there are multiple columns for comorbidities that are labeled numerically, each of which can have several different values.
I need to make it such that the column names are the comorbidities, and the values are either a 0 for no and a 1 for yes.
Example:
INC_KEY HYPERTENSION COPD ADLC DIABETES
0 160389417 1 0 0 0
1 160789043 1 1 0 1
2 160039662 1 0 1 0
3 160367584 0 0 0 1
4 160008818 0 0 0 0
I've given a simplified version, but there are 24 different possible comorbidities with which I need to do this.
I have tried pd.get_dummies(), however it does not work the way I need it to. The get_dummies function creates individual columns for each unique value for EACH COMORBID1-COMORBID24. So instead of 24 new columns, I end up with 24*24=576 new columns.
So with get_dummies the new column names would be:
COMORBID1_HYPERTENSION, COMORBID1_COPD, COMORBID1_ADLC, COMORBID1_DIABETES, COMORBID2_HYPERTENSION, COMOBID2_COPD, COMORBID2_ADLC, COMORBID2_DIABETES...
and so on all the way through 24.
What is the best way to do what I am trying to do?
Thank you in advance to anyone who helps :)
Upvotes: 0
Views: 601
Reputation: 863611
First convert INC_KEY
to index and remove only NaN
s columns by DataFrame.dropna
, then use get_dummies
with both parameters and last aggregate max
for only 0,1
values:
df = df.set_index('INC_KEY').dropna(axis=1, how='all')
df = pd.get_dummies(df, prefix='', prefix_sep='').groupby(level=0, axis=1).max()
print (df)
ADLC COPD Diabetes Hypertension Other
INC_KEY
160389417 0 0 0 1 0
160789043 0 1 1 1 0
160039662 1 0 0 1 1
160367584 0 0 1 0 0
160008818 0 0 0 0 0
Upvotes: 2