Reputation: 221
I have daily temperature data from 1901-1940. I want to exclude leap years i.e. remove any temperature data that falls on 2/29. My data is currently one long array. I am reshaping it so that every year is a row and every column is a day. I'm trying to remove the leap years with the last line of code here:
import requests
from datetime import date
params = {"sid": "PHLthr", "sdate":"1900-12-31", "edate":"2020-12-31", "elems": [{"name": "maxt", "interval": "dly", "duration": "dly", "prec": 6}]}
baseurl = "http://data.rcc-acis.org/StnData"
#get the data
resp = requests.post(baseurl, json=params)
#package into the dataframe
df = pd.DataFrame(columns=['date', 'tmax'], data=resp.json()['data'])
#convert the date column to datetimes
df['date']=pd.to_datetime(df['date'])
#select years
mask = (df['date'] >= '1900-01-01') & (df['date'] <= '1940-12-31')
Baseline=df.loc[mask]
#get rid of leap years:
Baseline=Baseline.loc[(Baseline['date'].dt.day!=29) & (Baseline['date'].dt.month!=2)]
but when I reshape the array I notice that there are 366 columns instead of 365 so I don't think I'm actually getting rid of february 29th data. How would I completely eliminate any temperature data that is recorded on 2/29 throughout my data set. I only want 365 data points for each year.
daily=pd.DataFrame(data={'date':Baseline.date,'tmax':Baseline.tmax})
daily['day']=daily.date.dt.dayofyear
daily['year']=daily.date.dt.year
daily.pivot(index='year', columns='day', values='tmax')
Upvotes: 0
Views: 892
Reputation: 31011
The source of your problem is that you used daily.date.dt.dayofyear.
Each day in a year, including Feb 29 has its own number.
To make thing worse, e.g. Mar 1 has dayofyear:
One of possible solutions is to set the day column to a string representation of month and day. To provide proper sort in the pivoted table, the month part should come first.
So, after you convert date column to datetime, to create both additional columns run:
daily['year'] = daily.date.dt.year
daily['day'] = daily.date.dt.strftime('%m-%d')
Then you can filter out Feb 29 and generate the pivot table in one go:
result = daily[daily.day != '02-29'].pivot(index='year', columns='day',
values='tmax')
For some limited source data sample, other than yours, I got:
day 02-27 02-28 03-01 03-02
year
2020 11 10 14 15
2021 11 21 22 24
Create 3 additional columns:
daily['year'] = daily.date.dt.year
daily['month'] = daily.date.dt.strftime('%m')
daily['day'] = daily.date.dt.strftime('%d')
Note the string representation of month and day, to keep leading zeroes.
Then filter out Feb 29 and generate the pivot table with a MulitiIndex on columns:
result = daily[(daily.month != '02') | (daily.day != '29')].pivot(
index='year', columns=['month', 'day'], values='tmax')
This time the result is:
month 02 03
day 27 28 01 02
year
2020 11 10 14 15
2021 11 21 22 24
Upvotes: 0
Reputation: 23156
You get 366 columns because of using dayofyear
. That will calculate the day per the actual calendar (i.e. without removing 29 Feb).
To see this:
>>> daily.iloc[1154:1157]
date tmax day year
1154 1904-02-28 38.000000 59 1904
1156 1904-03-01 39.000000 61 1904
1157 1904-03-02 37.000000 62 1904
Notice the day goes from 59 to 61 (the 60th day was 29 February 1904).
Upvotes: 0
Reputation: 54733
The easy way is to eliminate those items before building the array.
import requests
from datetime import date
params = {"sid": "PHLthr", "sdate":"1900-12-31", "edate":"2020-12-31", "elems": [{"name": "maxt", "interval": "dly", "duration": "dly", "prec": 6}]}
baseurl = "http://data.rcc-acis.org/StnData"
#get the data
resp = requests.post(baseurl, json=params)
vals = resp.json()
rows = [row for row in vals['data'] if '02-29' not in row[0]]
print(rows)
Upvotes: 0