Reputation: 1311
I have a multi-index dataframe with indexes for day and stock ticker. Here's a subset:
I want to create several lag variables. I've figured out how to create a one-day lag:
df['Number of Tweets [t-1]'] = df['Number of Tweets'].unstack().shift(1).stack()
Where I'm stuck is in creating a lag variable that aggregates (sums) values over days t-1 through t-3, or t-1 through t-7, or t-1 through t-30. For instance, I would like a column called 'Number of Tweets [sum of t-1 through t-3]'. I have played around with DateOffset and looked into resample but haven't been able to take it further than that. I also can't seem to find any answers in the Cookbook or examples in the documentation that would help. I'm spinning my wheels on this one so would appreciate any help.
Upvotes: 0
Views: 1134
Reputation: 109726
Use pd.rolling_sum on the shifted data. To calculate the rolling sum for t-3 through t-1, use a window length of 3 and shift the data by 1 (the default if no parameter is specified).
from pandas import Timestamp
# Create series
s = pd.Series({(Timestamp('2015-03-30 00:00:00'), 'AAPL'): 2,
(Timestamp('2015-03-30 00:00:00'), 'IBM'): 3,
(Timestamp('2015-03-30 00:00:00'), 'TWTR'): 2,
(Timestamp('2015-03-31 00:00:00'), 'AAPL'): 6,
(Timestamp('2015-03-31 00:00:00'), 'IBM'): 2,
(Timestamp('2015-03-31 00:00:00'), 'TWTR'): 7,
(Timestamp('2015-04-01 00:00:00'), 'AAPL'): 3,
(Timestamp('2015-04-01 00:00:00'), 'IBM'): 1,
(Timestamp('2015-04-01 00:00:00'), 'TWTR'): 2,
(Timestamp('2015-04-02 00:00:00'), 'AAPL'): 6,
(Timestamp('2015-04-02 00:00:00'), 'IBM'): 8,
(Timestamp('2015-04-02 00:00:00'), 'TWTR'): 2,
(Timestamp('2015-04-06 00:00:00'), 'AAPL'): 4,
(Timestamp('2015-04-06 00:00:00'), 'IBM'): 2,
(Timestamp('2015-04-06 00:00:00'), 'TWTR'): 6,
(Timestamp('2015-04-07 00:00:00'), 'AAPL'): 3,
(Timestamp('2015-04-07 00:00:00'), 'IBM'): 7,
(Timestamp('2015-04-07 00:00:00'), 'TWTR'): 8})
# View the data more easily:
s.unstack()
AAPL IBM TWTR
Date
2015-03-30 2 3 2
2015-03-31 6 2 7
2015-04-01 3 1 2
2015-04-02 6 8 2
2015-04-06 4 2 6
2015-04-07 3 7 8
# Calculate a rolling sum on date t for dates t-3 through t-1:
result = pd.rolling_sum(s.unstack().shift(), window=3) # .shift() <=> .shift(1)
>>> result
AAPL IBM TWTR
Date
2015-03-30 NaN NaN NaN
2015-03-31 NaN NaN NaN
2015-04-01 NaN NaN NaN
2015-04-02 11 6 11
2015-04-06 15 11 11
2015-04-07 13 11 10
# Restack the data:
>>> result.stack()
2015-04-02 AAPL 11
IBM 6
TWTR 11
2015-04-06 AAPL 15
IBM 11
TWTR 11
2015-04-07 AAPL 13
IBM 11
TWTR 10
Upvotes: 2