Reputation: 4338
I'd like to produce a summary dataframe after grouping by date. I want to have a column that shows the mean of a given column as it is and the mean of that same column after filtering for instances that are greater than 0. I figured out how I can do this (below), but it requires doing two separate groupby
calls, renaming the columns, and then joining them back together. I fell like one should be able to do this all in one call. I was trying to use eval
to do this but kept getting an error and being told to use apply
, that I couldn't use eval
on a groupby object.
Code which gets me what I want but doesn't seem very efficient:
# Sample data
data = pd.DataFrame(
{"year" : [2013, 2013, 2013, 2014, 2014, 2014],
"month" : [1, 2, 3, 1, 2, 3],
"day": [1, 1, 1, 1, 1, 1],
"delay": [0, -4, 50, -60, 9, 10]})
subset = (data
.groupby(['year', 'month', 'day'])['delay']
.mean()
.reset_index()
.rename(columns = {'delay' : 'avg_delay'})
)
subset_1 = (data[data.delay > 0]
.groupby(['year', 'month', 'day'])['delay']
.mean()
.reset_index()
.rename(columns = {'delay' : 'avg_delay_pos'})
)
combined = pd.merge(subset, subset_1, how='left', on=['year', 'month', 'day'])
combined
year month day avg_delay avg_delay_pos
0 2013 1 1 0 NaN
1 2013 2 1 -4 NaN
2 2013 3 1 50 50.0
3 2014 1 1 -60 NaN
4 2014 2 1 9 9.0
5 2014 3 1 10 10.0
Upvotes: 2
Views: 688
Reputation: 71580
IIUC, you could use the following code:
>>> data['avg_delay'] = data.pop('delay')
>>> data['avg_delay_pos'] = data.loc[data['avg_delay'].gt(0), 'avg_delay']
>>> data
day month year avg_delay avg_delay_pos
0 1 1 2013 0 NaN
1 1 2 2013 -4 NaN
2 1 3 2013 50 50.0
3 1 1 2014 -60 NaN
4 1 2 2014 9 9.0
5 1 3 2014 10 10.0
>>>
Explanation:
I first remove the delay
column, and assign it to the new name of avg_delay
, so I am virtually renaming the name of delay
to avg_delay
.
Then I create a new column called avg_delay_pos
, which first uses loc
to get the values greater than zero, and since the index doesn't reset, so it will make the indexes that are greater than zero to the values of avg_delay
, and the others won't contain any assignments, that said they will be NaN
as you expected.
Upvotes: 1
Reputation: 402533
The solution is specific to your problem, but you can do this using a single groupby call. To get "avg_delay_pos", you just have to remove negative (and zero) values.
df['delay_pos'] = df['delay'].where(df['delay'] > 0)
(df.filter(like='delay')
.groupby(pd.to_datetime(df[['year', 'month', 'day']]))
.mean()
.add_prefix('avg_'))
avg_delay avg_delay_pos
2013-01-01 0 NaN
2013-02-01 -4 NaN
2013-03-01 50 50.0
2014-01-01 -60 NaN
2014-02-01 9 9.0
2014-03-01 10 10.0
Breakdown
where
is used to mask values that are not positive.
df['delay_pos'] = df['delay'].where(df['delay'] > 0)
# df['delay'].where(df['delay'] > 0)
0 NaN
1 NaN
2 50.0
3 NaN
4 9.0
5 10.0
Name: delay, dtype: float64
Next, extract the delay columns we want to group on,
df.filter(like='delay')
delay delay_pos
0 0 NaN
1 -4 NaN
2 50 50.0
3 -60 NaN
4 9 9.0
5 10 10.0
Then perform a groupby
on the date,
_.groupby(pd.to_datetime(df[['year', 'month', 'day']])).mean()
delay delay_pos
2013-01-01 0 NaN
2013-02-01 -4 NaN
2013-03-01 50 50.0
2014-01-01 -60 NaN
2014-02-01 9 9.0
2014-03-01 10 10.0
Where pd.to_datetime
is used to convert the year/month/day columns into a single datetime column, it's more efficient to group on a single column than multiple.
pd.to_datetime(df[['year', 'month', 'day']])
0 2013-01-01
1 2013-02-01
2 2013-03-01
3 2014-01-01
4 2014-02-01
5 2014-03-01
dtype: datetime64[ns]
The final .add_prefix('avg_')
add prefix "_avg" to the result.
An alternative way to do this if you want separate year/month/day columns would be
df['delay_pos'] = df['delay'].where(df['delay'] > 0)
df.groupby(['year', 'month', 'day']).mean().add_prefix('avg_').reset_index()
year month day avg_delay avg_delay_pos
0 2013 1 1 0 NaN
1 2013 2 1 -4 NaN
2 2013 3 1 50 50.0
3 2014 1 1 -60 NaN
4 2014 2 1 9 9.0
5 2014 3 1 10 10.0
Upvotes: 0