Sitabja
Sitabja

Reputation: 33

How to group by according to the values of a list present in a column in dataframe python

I have a dataframe of movies from pandas like this

id, name,     genre, release_year 
1    A    [a,b,c]     2017
2    B    [b,c]       2017
3    C    [a,c]       2010
4    D    [d,c]       2010
....

I want to group by the movies according to the values present in the genre lists. My expected output is:

year, genre, number_of_movies
2017  a       1
2017  b       2
2017  c       2
2010  a       1
2010  c       2 
...

Can somebody please help me to achieve this?

Upvotes: 3

Views: 121

Answers (4)

jpp
jpp

Reputation: 164713

Here is a collections.Counter method, which has O(n) complexity, and removes the need for df.groupby / df.apply:

from collections import Counter
from itertools import product, chain
import pandas as pd

df = pd.DataFrame({'id': [1, 2, 3, 4],
                   'name': ['A', 'B', 'C', 'D'],
                   'genre': [['a', 'b', 'c'], ['b', 'c'], ['a', 'c'], ['d', 'c']],
                   'year': [2017, 2017, 2010, 2010]})

c = Counter(chain.from_iterable([list(product([x['year']], x['genre'])) \
                                 for idx, x in df.iterrows()]))

# Counter({(2010, 'a'): 1,
#          (2010, 'c'): 2,
#          (2010, 'd'): 1,
#          (2017, 'a'): 1,
#          (2017, 'b'): 2,
#          (2017, 'c'): 2})

df = pd.DataFrame.from_dict(c, orient='index')

#            0
# (2017, a)  1
# (2017, b)  2
# (2017, c)  2
# (2010, a)  1
# (2010, c)  2
# (2010, d)  1

Upvotes: 0

cs95
cs95

Reputation: 402643

For performance, use itertools.chain to flatten the genre column:

from itertools import chain

df = pd.DataFrame({
      'genre' : list(
           chain.from_iterable(df.genre.tolist())
       ), 
      'release_year' : df.release_year.repeat(df.genre.str.len())
})

df
  genre  release_year
0     a          2017
0     b          2017
0     c          2017
1     b          2017
1     c          2017
2     a          2010
2     c          2010
3     d          2010
3     c          2010

Now, group on genre and release_year and find the size of each group:

df.groupby(
     ['genre', 'release_year'], sort=False
 ).size()\
  .reset_index(name='number_of_movies')

  genre  release_year  number_of_movies
0     a          2017                 1
1     b          2017                 2
2     c          2017                 2
3     a          2010                 1
4     c          2010                 2
5     d          2010                 1

Upvotes: 1

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

Another cool method is to use Counter i.e

from collections import Counter

ndf = df.groupby('release_year')['genre'].apply(lambda x : Counter(np.concatenate(x.values))).reset_index()

ndf = ndf.set_axis('release_year,genre,number_of_movies'.split(','),inplace=False,axis=1)

Output:

   release_year genre  number_of_movies
0          2010     a               1.0
1          2010     c               2.0
2          2010     d               1.0
3          2017     a               1.0
4          2017     b               2.0
5          2017     c               2.0

Upvotes: 1

jezrael
jezrael

Reputation: 862911

You can create new DataFrame by contructor, reshape by stack and for count use groupby with size:

df1 = (pd.DataFrame(df['genre'].values.tolist(), index=df['release_year'].values)
         .stack()
         .reset_index(name='genre')
         .groupby(['release_year','genre'])
         .size()
         .reset_index(name='number_of_movies'))

print (df1)
   release_year genre  number_of_movies
0          2010     a                 1
1          2010     c                 2
2          2010     d                 1
3          2017     a                 1
4          2017     b                 2
5          2017     c                 2

Upvotes: 1

Related Questions