Khaned
Khaned

Reputation: 443

pandas: convert a list in a column into separate columns

dataframe df has a column

id  data_words
1   [salt,major,lab,water]
2   [lab,plays,critical,salt]
3   [water,success,major]

I want to make one-hot-code of the column

id  critical        lab         major    plays     salt    success   water 
1   0               1           1        0         1       0         1
2   1               1           0        1         1       0         0
3   0               0           1        1         0       1         0

What I tried:

Attempt 1:

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('data_words')),
                          columns=mlb.classes_,
                          index=df.index))

Error: ValueError: columns overlap but no suffix specified: Index(['class'], dtype='object')

Attempt 2:

I converted the list into simple comma separated string with the following code

df['data_words_Joined'] = df.data_words.apply(','.join)

it makes the dataframe as following

id  data_words
1   salt,major,lab,water
2   lab,plays,critical,salt
3   water,success,major

Then I tried

pd.concat([df,pd.get_dummies(df['data_words_Joined'])],axis=1)

But It makes all the words into one column name instead of separate words as separate columns

id  salt,major,lab,water    lab,plays,critical,salt    water,success,major
1   1                       0                          0
2   0                       1                          0  
3   0                       0                          1

Upvotes: 2

Views: 794

Answers (2)

Grayrigel
Grayrigel

Reputation: 3594

One possible approach could be to use get_dummies with your apply function:

new_df = df.data_words.apply(','.join).str.get_dummies(sep=',')
print(new_df)

Output:

    critical  lab  major  plays  salt  success  water
0         0    1      1      0     1        0      1
1         1    1      0      1     1        0      0
2         0    0      1      0     0        1      1

Tested with pandas version 1.1.2 and borrowed input data from Celius Stingher's Answer.

Upvotes: 2

Celius Stingher
Celius Stingher

Reputation: 18377

You can try with explode followed by pivot_table

df_e = df.explode('data_words')
print(df_e.pivot_table(index=df_e['id'],columns=df_e['data_words'],values='id',aggfunc='count',fill_value=0))

Returning the following output:

data_words  critical  lab  major  plays  salt  success  water
id                                                           
1                  0    1      1      0     1        0      1
2                  1    1      0      1     1        0      0
3                  0    0      1      0     0        1      1

Edit: Adding data for replication purposes:

df = pd.DataFrame({'id':[1,2,3],
      'data_words':[['salt','major','lab','water'],['lab','plays','critical','salt'],['water','success','major']]})

Which looks like:

   id                    data_words
0   1     [salt, major, lab, water]
1   2  [lab, plays, critical, salt]
2   3       [water, success, major]

Upvotes: 3

Related Questions