Sean Roudnitsky
Sean Roudnitsky

Reputation: 47

How to convert columns with multiple values into multiple columns with binary values?

I am working with a medical database, and I have a dataframe that looks like this:

      INC_KEY   COMORBID1    COMORBID2    COMORBID3  COMORBID4
0   160389417   Hypertension None         None       None
1   160789043   COPD         Hypertension Diabetes   None
2   160039662   Hypertension ADLC         Other      None
3   160367584   Diabetes     None         None       None
4   160008818   None         None         None       None

As you can see, there are multiple columns for comorbidities that are labeled numerically, each of which can have several different values.

I need to make it such that the column names are the comorbidities, and the values are either a 0 for no and a 1 for yes.

Example:

      INC_KEY  HYPERTENSION COPD ADLC DIABETES
 0  160389417             1    0    0        0
 1  160789043             1    1    0        1
 2  160039662             1    0    1        0
 3  160367584             0    0    0        1
 4  160008818             0    0    0        0

I've given a simplified version, but there are 24 different possible comorbidities with which I need to do this.

I have tried pd.get_dummies(), however it does not work the way I need it to. The get_dummies function creates individual columns for each unique value for EACH COMORBID1-COMORBID24. So instead of 24 new columns, I end up with 24*24=576 new columns.

So with get_dummies the new column names would be:

COMORBID1_HYPERTENSION, COMORBID1_COPD, COMORBID1_ADLC, COMORBID1_DIABETES, COMORBID2_HYPERTENSION, COMOBID2_COPD, COMORBID2_ADLC, COMORBID2_DIABETES...

and so on all the way through 24.

What is the best way to do what I am trying to do?

Thank you in advance to anyone who helps :)

Upvotes: 0

Views: 601

Answers (1)

jezrael
jezrael

Reputation: 863611

First convert INC_KEY to index and remove only NaNs columns by DataFrame.dropna, then use get_dummies with both parameters and last aggregate max for only 0,1 values:

df = df.set_index('INC_KEY').dropna(axis=1, how='all')
df = pd.get_dummies(df, prefix='', prefix_sep='').groupby(level=0, axis=1).max()
print (df)
           ADLC  COPD  Diabetes  Hypertension  Other
INC_KEY                                             
160389417     0     0         0             1      0
160789043     0     1         1             1      0
160039662     1     0         0             1      1
160367584     0     0         1             0      0
160008818     0     0         0             0      0

Upvotes: 2

Related Questions