Reputation: 6238
I have a ratings dataframe that has rows of userId, movieId, rating
. I want to find the user with the highest number of ratings.
Here is the code I've written:
import pandas as pd
ratings = pd.read_csv('ratings.csv') # userId,movieId,rating
user_rating_counts = ratings[['userId','movieId']].groupby('userId')['movieId'].agg(['count'])
top_rator = user_rating_counts[user_rating_counts['count']==user_rating_counts['count'].max()]
Here is how the file looks like:
userId,movieId,rating
1,1,4.0
1,3,4.0
1,6,4.0
1,47,5.0
1,50,5.0
1,70,3.0
1,101,5.0
1,110,4.0
When I view top_rator
in jupyter notebook it looks like this:
count
userId
414 2698
What I want to get from it is a tuple like:
(414, 2698)
How can I do that?
P.S. any comments on how I can do this better/faster/shorter would be appreciated.
Upvotes: 4
Views: 3187
Reputation: 18647
If there is only one user matching max, you can simply use:
next(top_rator.max(1).items())
top_rator.max(1)
will return:
userId
1 8
dtype: int64
Series.items()
lazily iterates over the Series, creating the tuple
of index and values in a zip
generator object.
next()
is used to access the 'next' (first) tuple
in this generator
If there are multiple users matching the max, use list comprehension instead:
[(idx, val) for idx, val in top_rator.max(1).items()]
Upvotes: 2
Reputation: 863166
Use groupby
with size
and then Series.agg
with max
and idxmax
in list:
tup = tuple(ratings.groupby('userId').size().agg(['idxmax','max']))
print (tup)
(1, 8)
Explanation:
First aggregate size
per groups:
#changed data - multiple groups
print (df)
userId movieId rating
0 1 1 4.0
1 1 3 4.0
2 1 6 4.0
3 2 47 5.0
4 2 50 5.0
5 2 70 3.0
6 2 101 5.0
7 3 110 4.0
print (df.groupby('userId').size())
userId
1 3
2 4
3 1
dtype: int64
Output is Series
, so added Series.agg
with list of functions idxmax
and max
for index and value of Series for maximum:
print (df.groupby('userId').size().agg(['idxmax','max']))
idxmax 2
max 4
dtype: int64
And last convert to tuple
:
print (tuple(df.groupby('userId').size().agg(['idxmax','max'])))
(2, 4)
Solution if multiple groups with same maximum sizes:
print (ratings)
userId movieId rating
0 1 1 4.0
1 1 3 4.0
2 1 6 4.0
3 2 47 5.0
4 2 50 5.0
5 2 70 3.0
6 3 101 5.0
7 3 110 4.0
First aggregate size
per groups, but there are 2 groups with max 3
value:
user_rating_counts = ratings.groupby('userId')['movieId'].size()
print (user_rating_counts)
userId
1 3
2 3
3 2
Name: movieId, dtype: int64
So use boolean indexing
first:
top_rator = (user_rating_counts[user_rating_counts == user_rating_counts.max()])
print (top_rator)
userId
1 3
2 3
Name: movieId, dtype: int64
Create DataFrame
and convert to list of tuples:
tup = list(map(tuple, top_rator.reset_index().values.tolist()))
print (tup)
[(1, 3), (2, 3)]
Upvotes: 4
Reputation: 88276
You can do:
sizes = df.groupby(['userId']).size()
(sizes.idxmax(), sizes.max())
#(1, 8)
Details:
Groupby userId
and get the size
of each group.
sizes = df.groupby(['userId']).size()
#userId
#1 8
#2 1
Create the tuple of the user with the highest number of ratings using idxmax
and max
:
(sizes.idxmax(), sizes.max())
#(1, 8)
Upvotes: 2