Nckh
Nckh

Reputation: 67

Get median of groups in pandas

I have a data-frame df and want to do is the following:

  1. Sort the data by one variable
  2. Group the data by the same variable
  3. sum up the value of another column for every group
  4. calculate the median value of all group sums

What I tried is:

median_old = df.sort_values('user_id').groupby('user_id')['total_play_seconds'].sum().median()

Although I believe my output is correct, the online course won't let me proceed, stating that the median value is incorrect.

Where did I go wrong? As this is a task of an online course, I don't have a reproducible example, but I hope the matter is clear.

Upvotes: 0

Views: 1358

Answers (1)

David
David

Reputation: 8318

I'll base my answer on the example taken from:

https://www.tutorialspoint.com/python_pandas/python_pandas_groupby.htm

import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

print(df)
median = df.sort_values('Team').groupby("Team")["Points"].sum().to_frame()["Points"].median()
print(median)

As you can see, after the groupby and sum you get a pandas Series object and not a data-frame again. So you can't apply the median on the desired group. So I believe all you need to do is add to_frame and then calculate the median with the same logic you calculated the sum.

So in your case it should be:

median_old = df.sort_values('user_id').groupby('user_id')['total_play_seconds'].sum().to_frame()["total_play_seconds"].median()

Upvotes: 1

Related Questions