elPastor
elPastor

Reputation: 8996

Use Python/Pandas indexed date as condition in holiday list

I'm opening a CSV file with two columns and about 10,000 rows. The first column has a unique date and time stamp (ascending in 30-minute intervals, called 'date_time') and the second column has an integer, 'intnum'. I use the date_time column as my index and then use conditions to sum only the integers that fall into specific date ranges. All of the conditions work perfectly, EXCEPT the last condition is based on matching those dates with the USFederalHolidayCalendar.

Here's the rub, the indexed date is more complex (eg. '2015-02-16 12:30:00.00000') than the holiday list date (eg. '2015-02-16', President's Day). So when I run an 'isin' function against the holiday list, it doesn't find all of the integers associated with the whole day because '2015-02-16 12:30:00.00000' is not equal to '2015-02-16', despite the fact that it is the same day.

Code snippet:

import numpy as np
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar, get_calendar 

newcal = get_calendar('USFederalHolidayCalendar')
holidays = newcal.holidays(start='2010-01-01', end='2016-12-31')

filename = "/Users/Me/Desktop/test.csv"

int_array = pd.read_csv(filename, header=0, parse_dates=['date_time'], index_col='date_time')

intnum_total = int(int_array['intnum'][(int_array.index.month >= 2) & 
   (int_array.index.month <= 3) & (int_array.index.hour >= 12) & 
   (int_array.index.isin(holidays) == TRUE)].sum()

print intnum_total

Now, I get no errors, so the syntax and functions work "properly", but I know for a fact the holiday match is not working.

Any thoughts?

Thanks ahead of time - this is my first post, so hopefully the formatting and question is clear.

Upvotes: 1

Views: 1679

Answers (2)

Primer
Primer

Reputation: 10302

Here are some some thoughts...

Say you have a list of holidays for 2016:

cal = USFederalHolidayCalendar()
holidays = cal.holidays(start='2016-01-01', end='2016-12-31')
print holidays.size

Which yields:

10

So there are 10 holidays in 2016 based on USFederalHolidayCalendar.

You also have your DateTimeIndex, which, let's say is covering 2015 and 2016:

idx = pd.DatetimeIndex(pd.date_range(start='2015-1-1', 
                                     end='2016-12-31', freq='30min'))
print idx.size

Which shows:

35041

Now if I would want to see how many holidays are in my 30 min based idx I would take the date part of the DateTimeIndex and compare it to date part of the holidays:

idx[pd.DatetimeIndex(idx.date).isin(holidays.date)].size

Which would give me:

480

Which is 10 holidays * 24 hours * 2 halfhours in an hour. Does that sound correct?

Note that when you do index.isin(other_index) you get back a boolean array which is sufficient for indexing, and you don't need to do an extra comparison index.isin(other_index) == True

Upvotes: 1

Alexander
Alexander

Reputation: 109726

Can't you just access the date from your timestamp and see if it is in your list of federal holidays? I don't know why you need your second integer index column; I would think a boolean value should suffice (e.g. fed_holiday).

df = pd.DataFrame(pd.date_range(start='2016-1-1', end='2016-12-31', freq='30min', name='ts'))

df['fed_holiday'] = [ts.date() in holidays for ts in df.ts]

>>> df.fed_holiday.sum() / (24 * 2.)
10.0

Upvotes: 1

Related Questions