Reputation: 191
I have a dataframe in Python that look likethe following:
Name Hobbies
0 Paul ["Watch_NBA", "Play_PS4"]
1 Jeff ["Play_hockey", "Read", "Play_PS4"]
2 Kyle ["Sleep", "Watch_NBA"]
I need to transform every element of the list in a new column and assign the value of 0 or 1 if it appears in the original list. The result show be the following:
Name Watch_NBA Play_PS4 Play_hockey Read Sleep
0 Paul 1 1 0 0 0
1 Jeff 0 1 1 1 0
2 Kyle 1 0 0 0 1
Someone knows how i could to this. Take in mind that i will use a lot of Hobbies in the column, so it show be a little automated and not hardcoded. Thanks!!!
Upvotes: 1
Views: 365
Reputation: 113905
In [86]: df
Out[86]:
Name Hobbies
0 Paul [NBA, PS4]
1 Jeff [Hockey, Read, PS4]
2 Kyle [Sleep, NBA]
In [87]: df['dummy'] = 1
In [88]: df.explode("Hobbies").pivot(index='Name', columns='Hobbies', values='dummy').fillna(value=0)
Out[88]:
Hobbies Hockey NBA PS4 Read Sleep
Name
Jeff 1.0 0.0 1.0 1.0 0.0
Kyle 0.0 1.0 0.0 0.0 1.0
Paul 0.0 1.0 1.0 0.0 0.0
Upvotes: 1
Reputation: 12018
get_dummies()
is good but sklearn's
MultiLabelBinarizer
has better performance:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
a = mlb.fit_transform(df["Hobbies"])
df_expanded = pd.DataFrame(a, columns=mlb.classes_, index=df.index)
# merge them using the following:
df_merged = df.merge(df_expanded, left_index=True, right_index=True)
print(df_merged)
index Name Hobbies Play_PS4 Play_hockey Read Sleep Watch_NBA
0 Paul [Watch_NBA, Play_PS4] 1 0 0 0 1
1 Jeff [Play_hockey, Read, Play_PS4] 1 1 1 0 0
2 Kyle [Sleep, Watch_NBA] 0 0 0 1 1
Upvotes: 2
Reputation: 3653
You can try this:
n = df['Name']
df = df['Hobbies'].apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')
df.insert(0, 'Name', n)
print(df)
Output:
Name Watch_NBA Play_PS4 Play_hockey Read Sleep
0 Paul 1 1 0 0 0
1 Jeff 0 1 1 1 0
2 Kyle 1 0 0 0 1
Upvotes: 0
Reputation: 2776
You want the get_dummies()
method. Documentation here.
For your example:
names = df.Name
df = pd.get_dummies(df.Hobbies.apply(pd.Series).stack()).sum(level=0)
df.insert(0, 'Name', names)
#output:
Name Play_PS4 Play_hockey Read Sleep Watch_NBA
0 Paul 1 0 0 0 1
1 Jeff 1 1 1 0 0
2 Kyle 0 0 0 1 1
Upvotes: 1