Reputation: 6901
I have days like this:
eventday_idxs
2005-01-07 00:00:00
2005-01-31 00:00:00
2005-02-15 00:00:00
2005-04-18 00:00:00
2005-05-11 00:00:00
2005-08-12 00:00:00
2005-08-15 00:00:00
2005-09-06 00:00:00
2005-09-19 00:00:00
2005-10-12 00:00:00
2005-10-13 00:00:00
2005-10-20 00:00:00
2006-01-10 00:00:00
2006-01-30 00:00:00
2006-02-10 00:00:00
2006-03-29 00:00:00
I want to do calculations From : To
ranges of it like this on AAPL stock dataset.
As i am beginner in Pandas i use loop and do like this.
aap1_10_years = pd.io.data.get_data_yahoo('AAPL',
start=datetime.datetime(2004, 12, 10),
end=datetime.datetime(2014, 12, 10))
one_day = timedelta(days=1)
for i,ind in enumerate(eventday_idxs):
try:
do_calculations(aapl_10_years[ ind: eventday_idxs[i+1] - one_day ]['High'])
except IndexError:
do_calculations(aapl_10_years[ ind:]['High'] )
How can i apply do_calcuations
without loops like this? Because loops like this are discouraged in panda because are slow, right?
Upvotes: 1
Views: 67
Reputation: 880509
The time spans between the events are not regular:
In [141]: eventday_idxs.diff().head()
Out[141]:
0 NaT
1 24 days
2 15 days
3 62 days
4 23 days
Name: 0, dtype: timedelta64[ns]
so we can't express the calculation using rolling_apply. However, if we could
assign an "event number" to each of the rows in aap1_10_years
, then we could
groupby these event numbers and apply do_calculations
to each group.
If we define:
# mark each event day with a 1
aap1_10_years.loc[eventday_idxs, 'event'] = 1
# use cumsum to assign an event number to each event range
aap1_10_years['event'] = aap1_10_years['event'].fillna(0).cumsum()
then aap1_10_years['event']
equals 1 for these rows:
In [144]: aap1_10_years.loc[aap1_10_years['event'] == 1, ['Close', 'event']]
Out[144]:
Close event
Date
2005-01-07 69.25 1
2005-01-10 68.96 1
2005-01-11 64.56 1
2005-01-12 65.46 1
2005-01-13 69.80 1
2005-01-14 70.20 1
2005-01-18 70.65 1
2005-01-19 69.88 1
2005-01-20 70.46 1
2005-01-21 70.49 1
2005-01-24 70.76 1
2005-01-25 72.05 1
2005-01-26 72.25 1
2005-01-27 72.64 1
2005-01-28 73.98 1
Thus event number 1 has been assigned to all the rows with dates between 2005-01-07 and 2005-01-28. And similarly, each of the other event ranges have been assigned a unique event number.
import datetime as DT
import pandas as pd
import pandas.io.data as pdata
eventday_idxs = pd.to_datetime(pd.read_table('data', header=None)[0])
aap1_10_years = pdata.get_data_yahoo(
'AAPL',
start=DT.datetime(2004, 12, 10),
end=DT.datetime(2014, 12, 10))
# mark each event day with a 1
aap1_10_years.loc[eventday_idxs, 'event'] = 1
# use cumsum to assign an event number to each event range
aap1_10_years['event'] = aap1_10_years['event'].fillna(0).cumsum()
mask = aap1_10_years['event'] > 0
aap1_10_years.loc[mask].groupby(['event'])['High'].apply(do_calculations)
Upvotes: 3