Reputation: 352
Let's say you have a dataframe as follows:
data = pd.DataFrame({'Year': [2019]*5+[2020]*5,
'Month': [1,1,2,2,3]*2,
'Hour': [0,1,2,3,4]*2,
'Value': [0.2,0.3,0.2,0.1,0.4,0.3,0.2,0.5,0.1,0.2]})
Then, set "low" times to be hours between 1 and 3 (inclusive), and "high" times to be all other hours (in this case, hours 0 and 4). What I would like to do is get the average Value
for the "low" and "high" times for each Year
and Month
. Ideally, these would be appended as new columns to the groupby() dataframe (i.e., the final dataframe would have Year
, Month
, Low
, and High
columns).
For loops work, but they're not ideal. I could also create a dummy variable (for instance, 0s and 1s) to signify the "low" and "high" times in the dataframe to groupby. However, it seems to me that there should be some way to use Pandas groupby(['Year', 'Month']).agg(...) to achieve the result in an efficient/optimal way. I haven't had any luck thus far using groupby+agg, mainly because agg() uses only a series (not the remaining dataframe), so one can't use a conditional within agg based on the Hour
to calculate the average Value
.
Expected result from sample data:
Year Month High Low
0 2019 1 0.2 0.30
1 2019 2 NaN 0.15
2 2019 3 0.4 NaN
3 2020 1 0.3 0.20
4 2020 2 NaN 0.30
5 2020 3 0.2 NaN
Any help is appreciated :)
Upvotes: 1
Views: 262
Reputation: 107767
Consider pivot_table
after creating a low/high type indicator field:
data['Type'] = np.where(data['Hour'].between(1,3), 'Low', 'High')
pvt_df = (pd.pivot_table(data, index=['Year', 'Month'],
columns='Type', values='Value', aggfunc=np.mean)
.reset_index()
.rename_axis(None, axis='columns')
)
print(pvt_df)
# Year Month High Low
# 0 2019 1 0.2 0.30
# 1 2019 2 NaN 0.15
# 2 2019 3 0.4 NaN
# 3 2020 1 0.3 0.20
# 4 2020 2 NaN 0.30
# 5 2020 3 0.2 NaN
Upvotes: 3
Reputation: 42946
Might not win the price for most beautiful piece of code, but if I understand you correctly, this is what you want.
(correct me if im wrong since theres no expected output included)
Groupby
4 times and concat the years and months together.
After that do a final merge to get all the columns together
low_hours = [1, 2, 3]
groupby1 = data[data.Hour.isin(low_hours)].groupby('Year').Value.mean().reset_index().rename({'Value':'Value_year_low'},axis=1)
groupby2 = data[~data.Hour.isin(low_hours)].groupby('Year').Value.mean().reset_index().rename({'Value':'Value_year_high'},axis=1).drop('Year', axis=1)
groupby3 = data[data.Hour.isin(low_hours)].groupby(['Year','Month']).Value.mean().reset_index().rename({'Value':'Value_month_low'},axis=1)
groupby4 = data[~data.Hour.isin(low_hours)].groupby(['Year','Month']).Value.mean().reset_index().rename({'Value':'Value_month_high'},axis=1).drop(['Year','Month'], axis=1)
df_final1 = pd.concat([groupby1, groupby2], axis=1)
df_final2 = pd.concat([groupby3, groupby4], axis=1)
df_final = pd.merge(df_final1, df_final2, on='Year')
print(df_final)
Year Value_year_low Value_year_high Month Value_month_low \
0 2019 0.200000 0.30 1 0.30
1 2019 0.200000 0.30 2 0.15
2 2020 0.266667 0.25 1 0.20
3 2020 0.266667 0.25 2 0.30
Value_month_high
0 0.2
1 0.4
2 0.3
3 0.2
Upvotes: 0