rhozzy
rhozzy

Reputation: 352

Pandas Groupby Conditional Aggregation

Let's say you have a dataframe as follows:

data = pd.DataFrame({'Year': [2019]*5+[2020]*5,
          'Month': [1,1,2,2,3]*2,
          'Hour': [0,1,2,3,4]*2,
          'Value': [0.2,0.3,0.2,0.1,0.4,0.3,0.2,0.5,0.1,0.2]})

Then, set "low" times to be hours between 1 and 3 (inclusive), and "high" times to be all other hours (in this case, hours 0 and 4). What I would like to do is get the average Value for the "low" and "high" times for each Year and Month. Ideally, these would be appended as new columns to the groupby() dataframe (i.e., the final dataframe would have Year, Month, Low, and High columns).

For loops work, but they're not ideal. I could also create a dummy variable (for instance, 0s and 1s) to signify the "low" and "high" times in the dataframe to groupby. However, it seems to me that there should be some way to use Pandas groupby(['Year', 'Month']).agg(...) to achieve the result in an efficient/optimal way. I haven't had any luck thus far using groupby+agg, mainly because agg() uses only a series (not the remaining dataframe), so one can't use a conditional within agg based on the Hour to calculate the average Value.

Expected result from sample data:

Year Month High Low 0 2019 1 0.2 0.30 1 2019 2 NaN 0.15 2 2019 3 0.4 NaN 3 2020 1 0.3 0.20 4 2020 2 NaN 0.30 5 2020 3 0.2 NaN

Any help is appreciated :)

Upvotes: 1

Views: 262

Answers (2)

Parfait
Parfait

Reputation: 107767

Consider pivot_table after creating a low/high type indicator field:

data['Type'] = np.where(data['Hour'].between(1,3), 'Low', 'High')

pvt_df = (pd.pivot_table(data, index=['Year', 'Month'], 
                         columns='Type', values='Value', aggfunc=np.mean)
            .reset_index()
            .rename_axis(None, axis='columns')
         )    

print(pvt_df)
#    Year  Month  High   Low
# 0  2019      1   0.2  0.30
# 1  2019      2   NaN  0.15
# 2  2019      3   0.4   NaN
# 3  2020      1   0.3  0.20
# 4  2020      2   NaN  0.30
# 5  2020      3   0.2   NaN

Upvotes: 3

Erfan
Erfan

Reputation: 42946

Might not win the price for most beautiful piece of code, but if I understand you correctly, this is what you want.

(correct me if im wrong since theres no expected output included)

Groupby 4 times and concat the years and months together. After that do a final merge to get all the columns together

low_hours = [1, 2, 3]

groupby1 = data[data.Hour.isin(low_hours)].groupby('Year').Value.mean().reset_index().rename({'Value':'Value_year_low'},axis=1)
groupby2 = data[~data.Hour.isin(low_hours)].groupby('Year').Value.mean().reset_index().rename({'Value':'Value_year_high'},axis=1).drop('Year', axis=1)
groupby3 = data[data.Hour.isin(low_hours)].groupby(['Year','Month']).Value.mean().reset_index().rename({'Value':'Value_month_low'},axis=1)
groupby4 = data[~data.Hour.isin(low_hours)].groupby(['Year','Month']).Value.mean().reset_index().rename({'Value':'Value_month_high'},axis=1).drop(['Year','Month'], axis=1)

df_final1 = pd.concat([groupby1, groupby2], axis=1)
df_final2 = pd.concat([groupby3, groupby4], axis=1)

df_final = pd.merge(df_final1, df_final2, on='Year')
print(df_final)
   Year  Value_year_low  Value_year_high  Month  Value_month_low  \
0  2019        0.200000             0.30      1             0.30   
1  2019        0.200000             0.30      2             0.15   
2  2020        0.266667             0.25      1             0.20   
3  2020        0.266667             0.25      2             0.30   

   Value_month_high  
0               0.2  
1               0.4  
2               0.3  
3               0.2  

Upvotes: 0

Related Questions