How to one-hot encode a dataframe where each row has lists

Question

I am trying to feed in data that has lists within lists of data to a machine learning algorithm:

for example a patient may have several medications and several responses to the medication they may also have names. So if they take more than 1 medication it will come as a list of 2 or more. They only have one name.

I believe one-hot encoding is the correct way to do so.

Here is what I have done so far:

I have a dataframe:

df = pandas.DataFrame([{'drug': ['drugA','drugB'], 'patient': 'john'}, {'drug': ['drugC','drugD'], 'patient': 'angel'}])

             drug patient
0  [drugA, drugB]    john
1  [drugC, drugD]   angel

I want to get something like:

  drugA  drugB drugC drugD patient
0  1       1     0     0     john
0  0       0     1     1     angel

I tried this:

pandas.get_dummies(df.apply(pandas.Series).stack()).sum(level=0)

But got:

TypeError: unhashable type: 'list'

andrew_reece · Accepted Answer

Drawing heavily on this answer, here's one approach:

df = pd.DataFrame([{'drug': ['drugA','drugB'], 'patient': 'john'}, 
                   {'drug': ['drugC','drugD'], 'patient': 'angel'}])
s = df.drug
      .apply(lambda x: pd.Series(x))
      .unstack()
df2 = df.join(pd.DataFrame(s.reset_index(level=0, drop=True)))
        .drop('drug',1)
        .rename(columns={0:'drug'})
df2.merge(pd.get_dummies(df2.drug), left_index=True, right_index=True)
   .drop('drug',1)

Output:

  patient  drugA  drugB  drugC  drugD
0    john    1.0    0.0    0.0    0.0
0    john    0.0    1.0    0.0    0.0
0    john    1.0    0.0    0.0    0.0
0    john    0.0    1.0    0.0    0.0
1   angel    0.0    0.0    1.0    0.0
1   angel    0.0    0.0    0.0    1.0
1   angel    0.0    0.0    1.0    0.0
1   angel    0.0    0.0    0.0    1.0

How to one-hot encode a dataframe where each row has lists

Answers (2)

Related Questions