Brian
Brian

Reputation: 97

Using Pandas groupby methods, find largest values in each group

By using Pandas groupby, I have data on how much activity certain users have on average any given day of the week. Grouped by user and day, I compute max and mean for several users in the last 30 days.

Now I want to find, for every user, which day of the week corresponds to their daily max activity, and what is the average magnitude of that activity.

What is the method in pandas to perform such a task?

The original data looks something like this:

    userID  countActivity   weekday
0   3       25              5
1   3       58              6
2   3       778             0
3   3       78208           1
4   3       6672            2

The object that has these groups is created from the following:

aggregations = {
    'countActivity': { 
        'maxDaily': 'max',
        'meanDaily': 'mean'
    }
}

dailyAggs = df.groupby(['userID','weekday']).agg(aggregations)

The groupby object looks something like this:

                    countActivity       
                maxDaily    meanDaily
userID  weekday     
3       0       84066       18275.6
        1       78208       20698.5
        2       172579      64930.75
        3       89535       25443
        4       6152        2809

Pandas groupby method filter seems to be needed here, but I'm stumped how on how to proceed.

Upvotes: 1

Views: 1007

Answers (1)

root
root

Reputation: 33793

I'd first do a groupby on 'userID', and then write an apply function to do the rest. The apply function will take a 'userID' group, perform another groupby on 'weekday' to do your aggregations, and then only return the row that contains the maximum value for maxDaily, which can be found with argmax.

def get_max_daily(grp):
    aggregations = {'countActivity': {'maxDaily': 'max', 'meanDaily': 'mean'}}
    grp = grp.groupby('weekday').agg(aggregations).reset_index()
    return grp.loc[grp[('countActivity', 'maxDaily')].argmax()]

result = df.groupby('userID').apply(get_max_daily)

I've added a row to your sample data to make sure the daily aggregations were working correctly, since your sample data only contains one entry per weekday:

   userID  countActivity  weekday
0       3             25        5
1       3             58        6
2       3            778        0
3       3          78208        1
4       3           6672        2
5       3          78210        1

The resulting output:

       weekday countActivity         
                   meanDaily maxDaily
userID                               
3            1         78209    78210

Upvotes: 5

Related Questions