Reputation: 103
I have below pandas dataset:
set_id
A,B
A,C,E
A
Desired result:
set_id set_id_1 set_id_2 set_id_3
A,B A B null
A,C,E A C E
A A null null
set_id can have n number of values. Suppose the maximum number of values in set_id is 100, I should have 100 new columns
I tried using multilabel binarizer
df1 = pd.DataFrame()
df1['set_id'] = df['set_id'].str.split(',')
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df1=df.join(pd.DataFrame(mlb.fit_transform(df['set_id']) ,columns=mlb.classes_,index=df.head(100).index))
It will create more than 100K columns as I have more than 100K unique records
Upvotes: 1
Views: 112
Reputation: 862511
Use str.split
with expand=True
for DataFrame
:
df1 = df['set_id'].str.split(',', expand=True)
Alternative faster solution with list comprehension:
df1 = pd.DataFrame([x.split(',') for x in df['set_id']])
df1.columns = [f'set_id_{x+1}' for x in df1.columns]
df1 = df.join(df1)
print (df1)
set_id set_id_1 set_id_2 set_id_3
0 A,B A B None
1 A,C,E A C E
2 A A None None
Upvotes: 1