Reputation: 7022
Let's say my DataFrame df
is created like this:
df = pd.DataFrame({"title" : ["Robin Hood", "Madagaskar"],
"genres" : ["Action, Adventure", "Family, Animation, Comedy"]},
columns=["title", "genres"])
and it looks like this:
title genres
0 Robin Hood Action, Adventure
1 Madagaskar Family, Animation, Comedy
Let's assume each movie can have any number of genres. How can I expand the DataFrame into
title genre
0 Robin Hood Action
1 Robin Hood Adventure
2 Madagaskar Family
3 Madagaskar Animation
4 Madagaskar Comedy
?
Upvotes: 6
Views: 4093
Reputation: 42916
Since pandas >= 0.25.0
we have a native method for this called explode
.
This method unnests each element in a list to a new row and repeats the other columns.
So first we have to call Series.str.split
on our string value to split the string to list of elements.
>>> df.assign(genres=df['genres'].str.split(', ')).explode('genres')
title genres
0 Robin Hood Action
0 Robin Hood Adventure
1 Madagaskar Family
1 Madagaskar Animation
1 Madagaskar Comedy
Upvotes: 1
Reputation: 210882
In [33]: (df.set_index('title')
['genres'].str.split(',\s*', expand=True)
.stack()
.reset_index(name='genre')
.drop('level_1',1))
Out[33]:
title genre
0 Robin Hood Action
1 Robin Hood Adventure
2 Madagaskar Family
3 Madagaskar Animation
4 Madagaskar Comedy
PS here you can find more generic approach.
Upvotes: 8
Reputation: 863166
You can use np.repeat
with numpy.concatenate
for flattening.
splitted = df['genres'].str.split(',\s*')
l = splitted.str.len()
df1 = pd.DataFrame({'title': np.repeat(df['title'].values, l),
'genres':np.concatenate(splitted.values)}, columns=['title','genres'])
print (df1)
title genres
0 Robin Hood Action
1 Robin Hood Adventure
2 Madagaskar Family
3 Madagaskar Animation
4 Madagaskar Comedy
Timings:
df = pd.concat([df]*100000).reset_index(drop=True)
In [95]: %%timeit
...: splitted = df['genres'].str.split(',\s*')
...: l = splitted.str.len()
...:
...: df1 = pd.DataFrame({'title': np.repeat(df['title'].values, l),
...: 'genres':np.concatenate(splitted.values)}, columns=['title','genres'])
...:
...:
1 loop, best of 3: 709 ms per loop
In [96]: %timeit (df.set_index('title')['genres'].str.split(',\s*', expand=True).stack().reset_index(name='genre').drop('level_1',1))
1 loop, best of 3: 750 ms per loop
Upvotes: 4