Phyo Arkar Lwin
Phyo Arkar Lwin

Reputation: 6901

How can i do calculations on subsets , Panda's way, without looping

I have days like this:

eventday_idxs
2005-01-07 00:00:00
2005-01-31 00:00:00
2005-02-15 00:00:00
2005-04-18 00:00:00
2005-05-11 00:00:00
2005-08-12 00:00:00
2005-08-15 00:00:00
2005-09-06 00:00:00
2005-09-19 00:00:00
2005-10-12 00:00:00
2005-10-13 00:00:00
2005-10-20 00:00:00
2006-01-10 00:00:00
2006-01-30 00:00:00
2006-02-10 00:00:00
2006-03-29 00:00:00

I want to do calculations From : To ranges of it like this on AAPL stock dataset.
As i am beginner in Pandas i use loop and do like this.

aap1_10_years = pd.io.data.get_data_yahoo('AAPL', 
                                 start=datetime.datetime(2004, 12, 10), 
                                 end=datetime.datetime(2014, 12, 10))
one_day = timedelta(days=1)
for i,ind in enumerate(eventday_idxs):
    try:
        do_calculations(aapl_10_years[ ind: eventday_idxs[i+1] - one_day ]['High'])
    except IndexError:
        do_calculations(aapl_10_years[ ind:]['High'] )

How can i apply do_calcuations without loops like this? Because loops like this are discouraged in panda because are slow, right?

Upvotes: 1

Views: 67

Answers (1)

unutbu
unutbu

Reputation: 880509

The time spans between the events are not regular:

In [141]: eventday_idxs.diff().head()
Out[141]: 
0       NaT
1   24 days
2   15 days
3   62 days
4   23 days
Name: 0, dtype: timedelta64[ns]

so we can't express the calculation using rolling_apply. However, if we could assign an "event number" to each of the rows in aap1_10_years, then we could groupby these event numbers and apply do_calculations to each group.

If we define:

# mark each event day with a 1
aap1_10_years.loc[eventday_idxs, 'event'] = 1
# use cumsum to assign an event number to each event range
aap1_10_years['event'] = aap1_10_years['event'].fillna(0).cumsum()

then aap1_10_years['event'] equals 1 for these rows:

In [144]: aap1_10_years.loc[aap1_10_years['event'] == 1, ['Close', 'event']]
Out[144]: 
            Close  event
Date                    
2005-01-07  69.25      1
2005-01-10  68.96      1
2005-01-11  64.56      1
2005-01-12  65.46      1
2005-01-13  69.80      1
2005-01-14  70.20      1
2005-01-18  70.65      1
2005-01-19  69.88      1
2005-01-20  70.46      1
2005-01-21  70.49      1
2005-01-24  70.76      1
2005-01-25  72.05      1
2005-01-26  72.25      1
2005-01-27  72.64      1
2005-01-28  73.98      1

Thus event number 1 has been assigned to all the rows with dates between 2005-01-07 and 2005-01-28. And similarly, each of the other event ranges have been assigned a unique event number.


import datetime as DT
import pandas as pd
import pandas.io.data as pdata
eventday_idxs = pd.to_datetime(pd.read_table('data', header=None)[0])
aap1_10_years = pdata.get_data_yahoo(
    'AAPL', 
    start=DT.datetime(2004, 12, 10), 
    end=DT.datetime(2014, 12, 10))

# mark each event day with a 1
aap1_10_years.loc[eventday_idxs, 'event'] = 1
# use cumsum to assign an event number to each event range
aap1_10_years['event'] = aap1_10_years['event'].fillna(0).cumsum()

mask = aap1_10_years['event'] > 0
aap1_10_years.loc[mask].groupby(['event'])['High'].apply(do_calculations)

Upvotes: 3

Related Questions