Ben G
Ben G

Reputation: 4338

Perform multiple operations in a single groupby call with pandas?

I'd like to produce a summary dataframe after grouping by date. I want to have a column that shows the mean of a given column as it is and the mean of that same column after filtering for instances that are greater than 0. I figured out how I can do this (below), but it requires doing two separate groupby calls, renaming the columns, and then joining them back together. I fell like one should be able to do this all in one call. I was trying to use eval to do this but kept getting an error and being told to use apply, that I couldn't use eval on a groupby object.

Code which gets me what I want but doesn't seem very efficient:

# Sample data

data = pd.DataFrame(
          {"year" : [2013, 2013, 2013, 2014, 2014, 2014],
           "month" : [1, 2, 3, 1, 2, 3],
           "day": [1, 1, 1, 1, 1, 1],
           "delay": [0, -4, 50, -60, 9, 10]})

subset = (data
          .groupby(['year', 'month', 'day'])['delay']
          .mean()
          .reset_index()
          .rename(columns = {'delay' : 'avg_delay'})
         )

subset_1 = (data[data.delay > 0]
          .groupby(['year', 'month', 'day'])['delay']
          .mean()
          .reset_index()
          .rename(columns = {'delay' : 'avg_delay_pos'})
         )

combined = pd.merge(subset, subset_1, how='left', on=['year', 'month', 'day'])
combined

   year  month  day  avg_delay  avg_delay_pos
0  2013      1    1          0            NaN
1  2013      2    1         -4            NaN
2  2013      3    1         50           50.0
3  2014      1    1        -60            NaN
4  2014      2    1          9            9.0
5  2014      3    1         10           10.0

Upvotes: 2

Views: 688

Answers (2)

U13-Forward
U13-Forward

Reputation: 71580

IIUC, you could use the following code:

>>> data['avg_delay'] = data.pop('delay')
>>> data['avg_delay_pos'] = data.loc[data['avg_delay'].gt(0), 'avg_delay']
>>> data
   day  month  year  avg_delay  avg_delay_pos
0    1      1  2013          0            NaN
1    1      2  2013         -4            NaN
2    1      3  2013         50           50.0
3    1      1  2014        -60            NaN
4    1      2  2014          9            9.0
5    1      3  2014         10           10.0
>>> 

Explanation:

  • I first remove the delay column, and assign it to the new name of avg_delay, so I am virtually renaming the name of delay to avg_delay.

  • Then I create a new column called avg_delay_pos, which first uses loc to get the values greater than zero, and since the index doesn't reset, so it will make the indexes that are greater than zero to the values of avg_delay, and the others won't contain any assignments, that said they will be NaN as you expected.

Upvotes: 1

cs95
cs95

Reputation: 402533

The solution is specific to your problem, but you can do this using a single groupby call. To get "avg_delay_pos", you just have to remove negative (and zero) values.

df['delay_pos'] = df['delay'].where(df['delay'] > 0)

(df.filter(like='delay')
   .groupby(pd.to_datetime(df[['year', 'month', 'day']]))
   .mean()
   .add_prefix('avg_'))                                                                                                                                 

            avg_delay  avg_delay_pos
2013-01-01          0            NaN
2013-02-01         -4            NaN
2013-03-01         50           50.0
2014-01-01        -60            NaN
2014-02-01          9            9.0
2014-03-01         10           10.0

Breakdown

where is used to mask values that are not positive.

df['delay_pos'] = df['delay'].where(df['delay'] > 0)
# df['delay'].where(df['delay'] > 0)                                                                                                  

0     NaN
1     NaN
2    50.0
3     NaN
4     9.0
5    10.0
Name: delay, dtype: float64

Next, extract the delay columns we want to group on,

df.filter(like='delay')                                                                                                             

   delay  delay_pos
0      0        NaN
1     -4        NaN
2     50       50.0
3    -60        NaN
4      9        9.0
5     10       10.0

Then perform a groupby on the date,

_.groupby(pd.to_datetime(df[['year', 'month', 'day']])).mean()

            delay  delay_pos
2013-01-01      0        NaN
2013-02-01     -4        NaN
2013-03-01     50       50.0
2014-01-01    -60        NaN
2014-02-01      9        9.0
2014-03-01     10       10.0

Where pd.to_datetime is used to convert the year/month/day columns into a single datetime column, it's more efficient to group on a single column than multiple.

pd.to_datetime(df[['year', 'month', 'day']])                                                                                        

0   2013-01-01
1   2013-02-01
2   2013-03-01
3   2014-01-01
4   2014-02-01
5   2014-03-01
dtype: datetime64[ns]

The final .add_prefix('avg_') add prefix "_avg" to the result.


An alternative way to do this if you want separate year/month/day columns would be

df['delay_pos'] = df['delay'].where(df['delay'] > 0)
df.groupby(['year', 'month', 'day']).mean().add_prefix('avg_').reset_index()

   year  month  day  avg_delay  avg_delay_pos
0  2013      1    1          0            NaN
1  2013      2    1         -4            NaN
2  2013      3    1         50           50.0
3  2014      1    1        -60            NaN
4  2014      2    1          9            9.0
5  2014      3    1         10           10.0

Upvotes: 0

Related Questions