Reputation: 149
I have a pandas dataframe that consists of standard numerical columns and additional column (char) that contains a list of values.
Instead of having these encoded as a list, I would like to create columns for each possible value across all the lists and encode whether the list contains each of the possible values as a boolean column for each unique value.
Input - Dataframe
char id
1 [a, b, c] 2
2 [a, c] 3
5 [c, a, d] 6
7 [] 8
Desired output - each potential list value gets a column, its presence in the char list will mark it true. It's important that empty lists equal false for all the created columns.
id char a char b char c char d
1 2 True True True False
2 3 True False True False
5 6 True False True True
7 8 False False False False
What is an efficient solution to solve this that scales for any amount of potential values that can be contained in the list (char in this instance)?
Upvotes: 0
Views: 803
Reputation: 323226
Let us check explode
after pandas 0.25
s=df.char.explode().str.get_dummies().sum(level=0).astype(bool)
df=df.join(s)
Or we can do the sklearn
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s=pd.DataFrame(mlb.fit_transform(df['char']),columns=mlb.classes_, index=df.index).astype(bool)
df=df.join(s)
Upvotes: 2