Gregory Saxton
Gregory Saxton

Reputation: 1311

Aggregating over a lagged time period in PANDAS

I have a multi-index dataframe with indexes for day and stock ticker. Here's a subset:

enter image description here

I want to create several lag variables. I've figured out how to create a one-day lag:

df['Number of Tweets [t-1]'] = df['Number of Tweets'].unstack().shift(1).stack()

Where I'm stuck is in creating a lag variable that aggregates (sums) values over days t-1 through t-3, or t-1 through t-7, or t-1 through t-30. For instance, I would like a column called 'Number of Tweets [sum of t-1 through t-3]'. I have played around with DateOffset and looked into resample but haven't been able to take it further than that. I also can't seem to find any answers in the Cookbook or examples in the documentation that would help. I'm spinning my wheels on this one so would appreciate any help.

Upvotes: 0

Views: 1134

Answers (1)

Alexander
Alexander

Reputation: 109726

Use pd.rolling_sum on the shifted data. To calculate the rolling sum for t-3 through t-1, use a window length of 3 and shift the data by 1 (the default if no parameter is specified).

from pandas import Timestamp

# Create series
s = pd.Series({(Timestamp('2015-03-30 00:00:00'), 'AAPL'): 2,
 (Timestamp('2015-03-30 00:00:00'), 'IBM'): 3,
 (Timestamp('2015-03-30 00:00:00'), 'TWTR'): 2,
 (Timestamp('2015-03-31 00:00:00'), 'AAPL'): 6,
 (Timestamp('2015-03-31 00:00:00'), 'IBM'): 2,
 (Timestamp('2015-03-31 00:00:00'), 'TWTR'): 7,
 (Timestamp('2015-04-01 00:00:00'), 'AAPL'): 3,
 (Timestamp('2015-04-01 00:00:00'), 'IBM'): 1,
 (Timestamp('2015-04-01 00:00:00'), 'TWTR'): 2,
 (Timestamp('2015-04-02 00:00:00'), 'AAPL'): 6,
 (Timestamp('2015-04-02 00:00:00'), 'IBM'): 8,
 (Timestamp('2015-04-02 00:00:00'), 'TWTR'): 2,
 (Timestamp('2015-04-06 00:00:00'), 'AAPL'): 4,
 (Timestamp('2015-04-06 00:00:00'), 'IBM'): 2,
 (Timestamp('2015-04-06 00:00:00'), 'TWTR'): 6,
 (Timestamp('2015-04-07 00:00:00'), 'AAPL'): 3,
 (Timestamp('2015-04-07 00:00:00'), 'IBM'): 7,
 (Timestamp('2015-04-07 00:00:00'), 'TWTR'): 8})

# View the data more easily:
s.unstack() 
            AAPL  IBM  TWTR
Date                       
2015-03-30     2    3     2
2015-03-31     6    2     7
2015-04-01     3    1     2
2015-04-02     6    8     2
2015-04-06     4    2     6
2015-04-07     3    7     8

# Calculate a rolling sum on date t for dates t-3 through t-1:
result = pd.rolling_sum(s.unstack().shift(), window=3)  # .shift() <=> .shift(1)

>>> result
            AAPL  IBM  TWTR
Date                       
2015-03-30   NaN  NaN   NaN
2015-03-31   NaN  NaN   NaN
2015-04-01   NaN  NaN   NaN
2015-04-02    11    6    11
2015-04-06    15   11    11
2015-04-07    13   11    10

# Restack the data:
>>> result.stack()
2015-04-02  AAPL    11
            IBM      6
            TWTR    11
2015-04-06  AAPL    15
            IBM     11
            TWTR    11
2015-04-07  AAPL    13
            IBM     11
            TWTR    10

Upvotes: 2

Related Questions