Optimize splitting column of lists into separate columns

Question

I have been looking into splitting a column made up of lists into separate columns. I have a solution, but it is very slow.

I have the following pandas dataframe

|basket                             |
|['two apple','A banana']           |
|['Red pear','A banana']            |
|['two apple','A banana','Red pear']|

Which I would like to transform into the following dataframe.

|basket                             |two apple|A banana|Red pear|
|['two apple','A banana']           |1        |1       |0       |
|['Red pear','A banana']            |0        |1       |1       |
|['two apple','A banana','Red pear']|1        |1       |1       |

I have the following code, after already creating the columns I needed:

for index,row in enumerate(df.basket):
    if index>0 and index%10000==0:
        print(index/len(df.baskets),' percent complete')
    for n,col in enumerate(df.columns):
        for pattern in row:
            if col == pattern:
                df[col,index]=1
                break

With the number of rows I have this is taking forever and I was hoping to find a more efficient way of populating the columns, even if I have to create them from the column of lists.

jezrael · Accepted Answer

Use MultiLabelBinarizer:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df['basket']),
                  columns=mlb.classes_, 
                  index=df.index))
print (df)
                            basket  A banana  Red pear  two apple
0            [two apple, A banana]         1         0          1
1             [Red pear, A banana]         1         1          0
2  [two apple, A banana, Red pear]         1         1          1

Optimize splitting column of lists into separate columns

Answers (1)

Related Questions