yukashima huksay
yukashima huksay

Reputation: 6238

python - How to find the largest groups with pandas

I have a ratings dataframe that has rows of userId, movieId, rating. I want to find the user with the highest number of ratings.

Here is the code I've written:

import pandas as pd
ratings = pd.read_csv('ratings.csv') # userId,movieId,rating
user_rating_counts = ratings[['userId','movieId']].groupby('userId')['movieId'].agg(['count'])
top_rator = user_rating_counts[user_rating_counts['count']==user_rating_counts['count'].max()]

Here is how the file looks like:

userId,movieId,rating
1,1,4.0
1,3,4.0
1,6,4.0
1,47,5.0
1,50,5.0
1,70,3.0
1,101,5.0
1,110,4.0

When I view top_rator in jupyter notebook it looks like this:

       count
userId  
414     2698

What I want to get from it is a tuple like:

(414, 2698)

How can I do that?

P.S. any comments on how I can do this better/faster/shorter would be appreciated.

Upvotes: 4

Views: 3187

Answers (3)

Chris Adams
Chris Adams

Reputation: 18647

If there is only one user matching max, you can simply use:

next(top_rator.max(1).items())

Explanation

top_rator.max(1) will return:

userId
1    8
dtype: int64

Series.items() lazily iterates over the Series, creating the tuple of index and values in a zip generator object.

next()is used to access the 'next' (first) tuple in this generator


If there are multiple users matching the max, use list comprehension instead:

[(idx, val) for idx, val in top_rator.max(1).items()]

Upvotes: 2

jezrael
jezrael

Reputation: 863166

Use groupby with size and then Series.agg with max and idxmax in list:

tup = tuple(ratings.groupby('userId').size().agg(['idxmax','max']))
print (tup)
(1, 8)

Explanation:

First aggregate size per groups:

#changed data - multiple groups
print (df)
   userId  movieId  rating
0       1        1     4.0
1       1        3     4.0
2       1        6     4.0
3       2       47     5.0
4       2       50     5.0
5       2       70     3.0
6       2      101     5.0
7       3      110     4.0

print (df.groupby('userId').size())
userId
1    3
2    4
3    1
dtype: int64

Output is Series, so added Series.agg with list of functions idxmax and max for index and value of Series for maximum:

print (df.groupby('userId').size().agg(['idxmax','max']))
idxmax    2
max       4
dtype: int64

And last convert to tuple:

print (tuple(df.groupby('userId').size().agg(['idxmax','max'])))
(2, 4)

Solution if multiple groups with same maximum sizes:

print (ratings)   
   userId  movieId  rating
0       1        1     4.0
1       1        3     4.0
2       1        6     4.0
3       2       47     5.0
4       2       50     5.0
5       2       70     3.0
6       3      101     5.0
7       3      110     4.0

First aggregate size per groups, but there are 2 groups with max 3 value:

user_rating_counts = ratings.groupby('userId')['movieId'].size()
print (user_rating_counts)
userId
1    3
2    3
3    2
Name: movieId, dtype: int64

So use boolean indexing first:

top_rator = (user_rating_counts[user_rating_counts == user_rating_counts.max()])
print (top_rator)
userId
1    3
2    3
Name: movieId, dtype: int64

Create DataFrame and convert to list of tuples:

tup = list(map(tuple, top_rator.reset_index().values.tolist()))
print (tup)
[(1, 3), (2, 3)]

Upvotes: 4

yatu
yatu

Reputation: 88276

You can do:

sizes = df.groupby(['userId']).size()
(sizes.idxmax(), sizes.max())
#(1, 8)

Details:

Groupby userId and get the size of each group.

sizes = df.groupby(['userId']).size()
#userId
#1    8
#2    1

Create the tuple of the user with the highest number of ratings using idxmax and max:

(sizes.idxmax(), sizes.max())
#(1, 8)

Upvotes: 2

Related Questions