Reputation: 33
I have a dataframe of movies from pandas like this
id, name, genre, release_year
1 A [a,b,c] 2017
2 B [b,c] 2017
3 C [a,c] 2010
4 D [d,c] 2010
....
I want to group by the movies according to the values present in the genre lists. My expected output is:
year, genre, number_of_movies
2017 a 1
2017 b 2
2017 c 2
2010 a 1
2010 c 2
...
Can somebody please help me to achieve this?
Upvotes: 3
Views: 121
Reputation: 164713
Here is a collections.Counter
method, which has O(n) complexity, and removes the need for df.groupby
/ df.apply
:
from collections import Counter
from itertools import product, chain
import pandas as pd
df = pd.DataFrame({'id': [1, 2, 3, 4],
'name': ['A', 'B', 'C', 'D'],
'genre': [['a', 'b', 'c'], ['b', 'c'], ['a', 'c'], ['d', 'c']],
'year': [2017, 2017, 2010, 2010]})
c = Counter(chain.from_iterable([list(product([x['year']], x['genre'])) \
for idx, x in df.iterrows()]))
# Counter({(2010, 'a'): 1,
# (2010, 'c'): 2,
# (2010, 'd'): 1,
# (2017, 'a'): 1,
# (2017, 'b'): 2,
# (2017, 'c'): 2})
df = pd.DataFrame.from_dict(c, orient='index')
# 0
# (2017, a) 1
# (2017, b) 2
# (2017, c) 2
# (2010, a) 1
# (2010, c) 2
# (2010, d) 1
Upvotes: 0
Reputation: 402643
For performance, use itertools.chain
to flatten the genre
column:
from itertools import chain
df = pd.DataFrame({
'genre' : list(
chain.from_iterable(df.genre.tolist())
),
'release_year' : df.release_year.repeat(df.genre.str.len())
})
df
genre release_year
0 a 2017
0 b 2017
0 c 2017
1 b 2017
1 c 2017
2 a 2010
2 c 2010
3 d 2010
3 c 2010
Now, group on genre
and release_year
and find the size
of each group:
df.groupby(
['genre', 'release_year'], sort=False
).size()\
.reset_index(name='number_of_movies')
genre release_year number_of_movies
0 a 2017 1
1 b 2017 2
2 c 2017 2
3 a 2010 1
4 c 2010 2
5 d 2010 1
Upvotes: 1
Reputation: 30605
Another cool method is to use Counter
i.e
from collections import Counter
ndf = df.groupby('release_year')['genre'].apply(lambda x : Counter(np.concatenate(x.values))).reset_index()
ndf = ndf.set_axis('release_year,genre,number_of_movies'.split(','),inplace=False,axis=1)
Output:
release_year genre number_of_movies
0 2010 a 1.0
1 2010 c 2.0
2 2010 d 1.0
3 2017 a 1.0
4 2017 b 2.0
5 2017 c 2.0
Upvotes: 1
Reputation: 862911
You can create new DataFrame
by contructor, reshape by stack
and for count use groupby
with size
:
df1 = (pd.DataFrame(df['genre'].values.tolist(), index=df['release_year'].values)
.stack()
.reset_index(name='genre')
.groupby(['release_year','genre'])
.size()
.reset_index(name='number_of_movies'))
print (df1)
release_year genre number_of_movies
0 2010 a 1
1 2010 c 2
2 2010 d 1
3 2017 a 1
4 2017 b 2
5 2017 c 2
Upvotes: 1