Megan Martin
Megan Martin

Reputation: 221

How to eliminate leap years in pandas data frame

I have daily temperature data from 1901-1940. I want to exclude leap years i.e. remove any temperature data that falls on 2/29. My data is currently one long array. I am reshaping it so that every year is a row and every column is a day. I'm trying to remove the leap years with the last line of code here:

import requests
from datetime import date
params = {"sid": "PHLthr", "sdate":"1900-12-31", "edate":"2020-12-31", "elems": [{"name": "maxt", "interval": "dly",  "duration": "dly", "prec": 6}]}
baseurl = "http://data.rcc-acis.org/StnData"
#get the data
resp = requests.post(baseurl, json=params)
#package into the dataframe
df = pd.DataFrame(columns=['date', 'tmax'], data=resp.json()['data'])
#convert the date column to datetimes
df['date']=pd.to_datetime(df['date'])
#select years
mask = (df['date'] >= '1900-01-01') & (df['date'] <= '1940-12-31')
Baseline=df.loc[mask]
#get rid of leap years:
Baseline=Baseline.loc[(Baseline['date'].dt.day!=29) & (Baseline['date'].dt.month!=2)]

but when I reshape the array I notice that there are 366 columns instead of 365 so I don't think I'm actually getting rid of february 29th data. How would I completely eliminate any temperature data that is recorded on 2/29 throughout my data set. I only want 365 data points for each year.

daily=pd.DataFrame(data={'date':Baseline.date,'tmax':Baseline.tmax})
daily['day']=daily.date.dt.dayofyear
daily['year']=daily.date.dt.year
daily.pivot(index='year', columns='day', values='tmax')

Upvotes: 0

Views: 892

Answers (3)

Valdi_Bo
Valdi_Bo

Reputation: 31011

The source of your problem is that you used daily.date.dt.dayofyear.

Each day in a year, including Feb 29 has its own number.

To make thing worse, e.g. Mar 1 has dayofyear:

  • 61 in leap years,
  • 60 in non-leap years.

One of possible solutions is to set the day column to a string representation of month and day. To provide proper sort in the pivoted table, the month part should come first.

So, after you convert date column to datetime, to create both additional columns run:

daily['year'] = daily.date.dt.year
daily['day'] = daily.date.dt.strftime('%m-%d')

Then you can filter out Feb 29 and generate the pivot table in one go:

result = daily[daily.day != '02-29'].pivot(index='year', columns='day',
    values='tmax')

For some limited source data sample, other than yours, I got:

day   02-27  02-28  03-01  03-02
year                            
2020     11     10     14     15
2021     11     21     22     24

An alternative

Create 3 additional columns:

daily['year']  = daily.date.dt.year
daily['month'] = daily.date.dt.strftime('%m')
daily['day']   = daily.date.dt.strftime('%d')

Note the string representation of month and day, to keep leading zeroes.

Then filter out Feb 29 and generate the pivot table with a MulitiIndex on columns:

result = daily[(daily.month != '02') | (daily.day != '29')].pivot(
    index='year', columns=['month', 'day'], values='tmax')

This time the result is:

month  02      03    
day    27  28  01  02
year                 
2020   11  10  14  15
2021   11  21  22  24

Upvotes: 0

not_speshal
not_speshal

Reputation: 23156

You get 366 columns because of using dayofyear. That will calculate the day per the actual calendar (i.e. without removing 29 Feb).

To see this:

>>> daily.iloc[1154:1157]

           date       tmax  day  year
1154 1904-02-28  38.000000   59  1904
1156 1904-03-01  39.000000   61  1904
1157 1904-03-02  37.000000   62  1904

Notice the day goes from 59 to 61 (the 60th day was 29 February 1904).

Upvotes: 0

Tim Roberts
Tim Roberts

Reputation: 54733

The easy way is to eliminate those items before building the array.

import requests
from datetime import date

params = {"sid": "PHLthr", "sdate":"1900-12-31", "edate":"2020-12-31", "elems": [{"name": "maxt", "interval": "dly",  "duration": "dly", "prec": 6}]}
baseurl = "http://data.rcc-acis.org/StnData"
#get the data
resp = requests.post(baseurl, json=params)
vals = resp.json()
rows = [row for row in vals['data'] if '02-29' not in row[0]]
print(rows)

Upvotes: 0

Related Questions