F. Suyuti
F. Suyuti

Reputation: 327

Lags data on Pandas dataframes

I'm using sklearn to make a forecast. Usually i will create lagged data to forecasting, this is easy if the feature only consist 1 column.

for example:

index  temperature
1      100
2      80
3      50
4      90
5      110

Usually i will create another column with shifting the temperature value to create lagged data using function like:

dataframe.temperature.shift()

So my dataframe become:

index  temperature temperature2 temperature3
1      100         NaN          NaN
2      80          100          NaN
3      50          80           100
4      90          50           80
5      110         90           50

Then when i want to forecast, i can fit using code like:

x = dataframe.loc[:,('temperature3','temperature2')]
y = dataframe.temperature
model.fit([x],y)

The problem is when i have a lot of column to create lag data, and need more lagged data for each column, my dataframe will be too big.

Is there any simple approach that i can use? Thanks!

For reference, this is my dataframe:

import pandas as pd
import talib

df = pd.read_csv('..\\data\\uj5.CSV', names=['date','time','open','high','low','close','volume'])
df.index = pd.to_datetime(df.date + df.time,format='%Y.%m.%d%H:%M')


# assuming this is the 'X', let say i need 100 lags of these column to predict the 'Y'
df['CDLBELTHOLD'] = talib.CDLBELTHOLD(df.open.values, df.high.values, df.low.values, df.close.values)
df['CDLCLOSINGMARUBOZU'] = talib.CDLCLOSINGMARUBOZU(df.open.values, df.high.values, df.low.values, df.close.values)
df['CDLDOJI'] = talib.CDLDOJI(df.open.values, df.high.values, df.low.values, df.close.values)
df['CDLHIKKAKE'] = talib.CDLHIKKAKE(df.open.values, df.high.values, df.low.values, df.close.values)
df['CDLLONGLEGGEDDOJI'] = talib.CDLLONGLEGGEDDOJI(df.open.values, df.high.values, df.low.values, df.close.values)
df['CDLLONGLINE'] = talib.  CDLLONGLINE(df.open.values, df.high.values, df.low.values, df.close.values)
df['CDLSHORTLINE'] = talib.CDLSHORTLINE(df.open.values, df.high.values, df.low.values, df.close.values)
df['CDLSPINNINGTOP'] = talib.CDLSPINNINGTOP(df.open.values, df.high.values, df.low.values, df.close.values)
df['atr'] = talib.ATR(df.high.values, df.low.values, df.close.values,timeperiod=14)


#assuming this is the Y
df['target'] = #some int value

Upvotes: 1

Views: 1758

Answers (2)

smci
smci

Reputation: 33938

So don't create multiple columns with lag-n data.

Only dynamically compute the lagged data when you need it. For example, it might even be possible to frame the computation with all the lags it as multiplication by some matrix kernel. We don't know till we see your code. Show us your specific code or formula you're trying to compute if you want a more specific answer.

EDIT: to the downvoters, this is actually the more correct answer. If you clog up your dataframe with all sorts of intermediate and temporary columns, you will waste memory. On stuff you can dynamically compute as-needed. That's the wrong way to go. Same general principle as lazy-evaluation in Spark.

Upvotes: -1

mm441
mm441

Reputation: 495

I wouldn't use pandas here if you end up with a large matrix . Numpy is more appropriate here:

import numpy as np

x = you_data_frame.values
max_lag = 10 # number of lags you want to have 

m = np.array([])
for i in np.arange(len(x) - max_lag):
    new_row = x[i:i+max_lag][None, :]
    if len(m) == 0:
        m = new_row
    else:
        m = np.vstack([m, new_row])

then you cat fit your data using

model.fit(m[:, 1:], m[:, 0])

Upvotes: 3

Related Questions