Reputation: 147
I have time series data ranging from 2005-2014 with entries for each day of the year and the corresponding maximum temperature. I have created this series from a larger data set called data.
In[80]: data
Out[80]:
ID Date Element Data_Value
41334 USC00208080 2005-01-01 TMAX 33
55428 USC00207308 2005-01-01 TMIN -44
32266 USC00208202 2005-01-01 TMAX 150
2073 USC00203712 2005-01-01 TMAX 144
55424 USC00207308 2005-01-01 TMAX 150
In[79]: max_temp
Out[79]:
Date
2005-01-01 156
2005-01-02 139
2005-01-03 133
2005-01-04 39
2005-01-05 33
max_temp = data.groupby("Date")["Data_Value"].max()
As you can see this returns the max for each date for each year. What I want to do is find the maximum value for each day over the range 2005-2014
Desired result:
In[79]: max_temp
Out[79]:
Date
01-01 # max for Jan 1st between 2005-2014
01-02 # max for Jan 2nd between 2005-2014
01-03 # max for Jan 3rd between 2005-2014
01-04 # max for Jan 4th ...
01-05 # max for Jan 5th ...
Upvotes: 1
Views: 2275
Reputation: 3602
First of all I would suggest to split datetime to day, month and year.
data['year'] = data['Date'].dt.year
data['month'] = data['Date'].dt.month
data['day'] = data['Date'].dt.day
Then you can group by only by day and month like:
max_temp = data.groupby(['month','day'])["Data_Value"].max()
Upvotes: 4
Reputation: 148910
First, the direct answer to your question. You need the dt
accessor to get day or month from a datetime64 column:
max_temp = data.groupby([data["Date"].dt.month, data["Date"].dt.day])["Data_Value"].max()
That being said, on a meteorological point of view, using statistics on a single day of the year does not make sense, because the temperature (or other meteo measures) vary too much. Professionals use stats on full year, quarter, month or decade (1-10, 11-20, 21-end_of_month). Using a shorter period only add noise.
Upvotes: 0