Reputation: 73
I have daily rainfall data that looks like the following:
Date Rainfall (mm)
1922-01-01 0.0
1922-01-02 0.0
1922-01-03 0.0
1922-01-04 0.0
1922-01-05 31.5
1922-01-06 0.0
1922-01-07 0.0
1922-01-08 0.0
1922-01-09 0.0
1922-01-10 0.0
1922-01-11 0.0
1922-01-12 9.1
1922-01-13 6.4
I am trying to work out the maximum value for each month for each year, and also what date the maximum value occurred on. I have been using the code:
rain_data.groupby(pd.Grouper(freq = 'M'))['Rainfall (mm)'].max()
This is returning the correct maximum value but returns the end date of each month rather than the date that maximum event occurred on.
1974-11-30 0.0
I have also tried using .idxmax(), but this also just return the end values of each month.
Any suggestions on how I could get the correct date?
Upvotes: 1
Views: 56
Reputation: 59549
pd.Grouper
seems to change the order within groups for Datetime
, which breaks the usual trick of .sort_values
+ .tail
. Instead group on the year and month:
df.sort_values('Rainfall (mm)').groupby([df.Date.dt.year, df.Date.dt.month]).tail(1)
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'Date': pd.date_range('1922-01-01', freq='D', periods=100),
'Rainfall (mm)': np.random.randint(1,100,100)})
df.sort_values('Rainfall (mm)').groupby([df.Date.dt.month, df.Date.dt.year]).tail(1)
# Date Rainfall (mm)
#82 1922-03-24 92
#35 1922-02-05 98
#2 1922-01-03 99
#90 1922-04-01 99
The problem with pd.Grouper
is that it creates a DatetimeIndex
with an end of the month frequency, which we don't really need and we're using .apply
. This give you a new index, and is nicely sorted by date though!
(df.groupby(pd.Grouper(key='Date', freq='1M'))
.apply(lambda x: x.loc[x['Rainfall (mm)'].idxmax()])
.reset_index(drop=True))
# Date Rainfall (mm)
#0 1922-01-03 99
#1 1922-02-05 98
#2 1922-03-24 92
#3 1922-04-01 99
Also can with .drop_duplicates
using the first 7 characters of the date to get the Year-Month
(df.assign(ym = df.Date.astype(str).str[0:7])
.sort_values('Rainfall (mm)')
.drop_duplicates('ym', keep='last')
.drop(columns='ym'))
Upvotes: 1