Reputation: 1282
I have a time-series DataFrame and I want to replicate each of my 200 features/columns as additional lagged features. So at the moment I have features at time t and want to create features at timestep t-1, t-2 and so on.
I know this is best done with df.shift() but I'm having trouble putting it altogether. I want to also rename the columns to 'feature (t-1)', 'feature (t-2)'.
My pseudo-code attempt would be something like:
lagged_values = [1,2,3,10]
for every lagged_values
for every column, make a new feature column with df.shift(lagged_values)
make new column have name 'original col name'+'(t-(lagged_values))'
In the end if I have 200 columns and 4 lagged timesteps I would have a new df with 1,000 features (200 each at t, t-1, t-2, t-3 and t-10).
I have found something similar but it doesn't keep the original column names (renames to var1, var2, etc) as per machine learning mastery. Unfortunately I don't understand it well enough to modify it to my problem.
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
"""
Frame a time series as a supervised learning dataset.
Arguments:
data: Sequence of observations as a list or NumPy array.
n_in: Number of lag observations as input (X).
n_out: Number of observations as output (y).
dropnan: Boolean whether or not to drop rows with NaN values.
Returns:
Pandas DataFrame of series framed for supervised learning.
"""
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols, names = list(), list()
# input sequence (t-n, ... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
# forecast sequence (t, t+1, ... t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
# put it all together
agg = concat(cols, axis=1)
agg.columns = names
# drop rows with NaN values
if dropnan:
agg.dropna(inplace=True)
return agg
Upvotes: 12
Views: 11297
Reputation: 4873
For those using the statsmodels
package, there's a function called lagmat
that does exactly this:
from statsmodels.tsa.tsatools import lagmat
df = lagmat(df, maxlag=3, use_pandas=True)
I don't think you can control the column names though, you'll get ColName.L.1
, etc.
Upvotes: 2
Reputation: 1
Its significantly faster to map across the lags and shift all columns at once. father than iterating through each lag / column..
from functools import partial
def addSimpleLags(df,lag_list,col_list,direction = 'lag'):
if direction == 'lead':
lag_list = map(lambda x: x*(-1),lag_list)
arr_lags = list(map(partial(_buildLags,df=df,
col_list=col_list,
direction = direction),
lag_list))
df = pd.concat([df]+arr_lags,axis = 1)
return df
def _buildLags(lag,df,col_list,direction):
return df[col_list].shift(lag).add_suffix('_{}_{}'.format(np.abs(lag),direction))
Trying this out... ~5x faster than the accepted solution (for 5 lags x 261 cols x 3M rows)
Upvotes: 0
Reputation: 76
If you have a large data frame and rely on a substantial number of lagged values, you may want to apply a more efficient solution using pd.concat and pd.DataFrame.add_sufix:
df = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))
lags = range(0, 3)
df = pd.concat([df.shift(t).add_suffix(f" (t-{t})") for t in lags], axis=1) # add lags
Upvotes: 1
Reputation: 109626
You can create the additional columns using a dictionary comprehension and then add them to your dataframe via assign
.
df = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))
lags = range(1, 3) # Just two lags for demonstration.
>>> df.assign(**{
f'{col} (t-{lag})': df[col].shift(lag)
for lag in lags
for col in df
})
A B A (t-1) A (t-2) B (t-1) B (t-2)
0 -0.773571 1.945746 NaN NaN NaN NaN
1 1.375648 0.058043 -0.773571 NaN 1.945746 NaN
2 0.727642 1.802386 1.375648 -0.773571 0.058043 1.945746
3 -2.427135 -0.780636 0.727642 1.375648 1.802386 0.058043
4 1.542809 -0.620816 -2.427135 0.727642 -0.780636 1.802386
Upvotes: 31