Reputation: 485
I have been looking into splitting a column made up of lists into separate columns. I have a solution, but it is very slow.
I have the following pandas dataframe
|basket |
|['two apple','A banana'] |
|['Red pear','A banana'] |
|['two apple','A banana','Red pear']|
Which I would like to transform into the following dataframe.
|basket |two apple|A banana|Red pear|
|['two apple','A banana'] |1 |1 |0 |
|['Red pear','A banana'] |0 |1 |1 |
|['two apple','A banana','Red pear']|1 |1 |1 |
I have the following code, after already creating the columns I needed:
for index,row in enumerate(df.basket):
if index>0 and index%10000==0:
print(index/len(df.baskets),' percent complete')
for n,col in enumerate(df.columns):
for pattern in row:
if col == pattern:
df[col,index]=1
break
With the number of rows I have this is taking forever and I was hoping to find a more efficient way of populating the columns, even if I have to create them from the column of lists.
Upvotes: 2
Views: 112
Reputation: 862841
Use MultiLabelBinarizer
:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df['basket']),
columns=mlb.classes_,
index=df.index))
print (df)
basket A banana Red pear two apple
0 [two apple, A banana] 1 0 1
1 [Red pear, A banana] 1 1 0
2 [two apple, A banana, Red pear] 1 1 1
Upvotes: 4