Reputation: 3
Consider this weather dataset (weather data for 50 years in 3hr intervals):
YEAR M D H WSP HS
1957-09-01 06:00:00 1957 9 1 6 8.9 0.9
1957-09-01 09:00:00 1957 9 1 9 6.4 0.6
1957-09-01 12:00:00 1957 9 1 12 5.2 0.4
1957-09-01 15:00:00 1957 9 1 15 1.9 0.3
1957-09-01 18:00:00 1957 9 1 18 3.7 0.3
1957-09-01 21:00:00 1957 9 1 21 3.9 0.4
1957-09-02 00:00:00 1957 9 2 0 5.6 0.4
1957-09-02 03:00:00 1957 9 2 3 5.3 0.5
I need to resample the dataset into 48hr periods and get the max HS value for that interval. Then this value needs to be groupby month to get the mean over the 50 year period for each month.
However, the resample can often include a range of values over two months:
1957-09-04 12:00:00 1957 9 4 12 7.8 1.1
1957-09-04 15:00:00 1957 9 4 15 6.7 1.1
1957-09-04 18:00:00 1957 9 4 18 9.4 1.1
1957-09-04 21:00:00 1957 9 4 21 9.7 1.2
1957-09-05 00:00:00 1957 9 5 0 8.9 1.2
1957-09-05 03:00:00 1957 9 5 3 9.0 1.1
I need the resample to be restricted to the month ONLY so the mean for the month over the entire dataset is not skewed by data from other months.
Upvotes: 0
Views: 257
Reputation: 4017
Note: I extended your data set a bit with some phoney data to better illustrate the groupings. A column name (DT) for the datetime was added as well. The dataset can be found here: weather.csv
df = pd.DataFrame()
df = pd.read_csv('./data/weather.csv')
df['date'] = pd.to_datetime(df['DT'], format='%Y-%m-%d %H:%M:%S')
# Grouping by 48 hours with a base of 0 (so the start of the grouping window begins midnight - and without a time)
# The max returns the highest number from the HS column.
df_48h = df.groupby(pd.Grouper(key='date', freq='48h', base=0))['HS'].max()
print(df_48h)
# Output:
# date
# 1957-09-01 0.6
# 1957-09-03 0.9
# ...
# 1957-10-01 0.6
# 1957-10-03 0.9
# Group all the 'high' values by month and get the mean
df_mm = df_48h.groupby(pd.Grouper(freq='M')).mean()
print(df_mm)
# Output:
# date
# 1957-09-30 0.75
# 1957-10-31 0.75
The Jupyter Notebook I played around in can be found here: pandas_48h_and_month_grouping.ipynb
Hope it helps, enjoy!
Upvotes: 0
Reputation: 3455
First take the data for one month and calculate the average of HS for 48 hours periods (starting at 00:00 hours on day 1 up to 24:00 hours on day 2, and so on [note the last period may be less than 48 hours], then repeat this for the next month and so on. To get the average per month take the average for all of the 48 hour periods per month (there will be either 14, 15 or 16 depending on the month). Having said that, would it not be much simpler to take the average of the data for one month, I am not sure why you want to have it first grouped by 48 hours?
Upvotes: 1