Reputation: 327
I'm using sklearn to make a forecast. Usually i will create lagged data to forecasting, this is easy if the feature only consist 1 column.
for example:
index temperature
1 100
2 80
3 50
4 90
5 110
Usually i will create another column with shifting the temperature value to create lagged data using function like:
dataframe.temperature.shift()
So my dataframe become:
index temperature temperature2 temperature3
1 100 NaN NaN
2 80 100 NaN
3 50 80 100
4 90 50 80
5 110 90 50
Then when i want to forecast, i can fit using code like:
x = dataframe.loc[:,('temperature3','temperature2')]
y = dataframe.temperature
model.fit([x],y)
The problem is when i have a lot of column to create lag data, and need more lagged data for each column, my dataframe will be too big.
Is there any simple approach that i can use? Thanks!
For reference, this is my dataframe:
import pandas as pd
import talib
df = pd.read_csv('..\\data\\uj5.CSV', names=['date','time','open','high','low','close','volume'])
df.index = pd.to_datetime(df.date + df.time,format='%Y.%m.%d%H:%M')
# assuming this is the 'X', let say i need 100 lags of these column to predict the 'Y'
df['CDLBELTHOLD'] = talib.CDLBELTHOLD(df.open.values, df.high.values, df.low.values, df.close.values)
df['CDLCLOSINGMARUBOZU'] = talib.CDLCLOSINGMARUBOZU(df.open.values, df.high.values, df.low.values, df.close.values)
df['CDLDOJI'] = talib.CDLDOJI(df.open.values, df.high.values, df.low.values, df.close.values)
df['CDLHIKKAKE'] = talib.CDLHIKKAKE(df.open.values, df.high.values, df.low.values, df.close.values)
df['CDLLONGLEGGEDDOJI'] = talib.CDLLONGLEGGEDDOJI(df.open.values, df.high.values, df.low.values, df.close.values)
df['CDLLONGLINE'] = talib. CDLLONGLINE(df.open.values, df.high.values, df.low.values, df.close.values)
df['CDLSHORTLINE'] = talib.CDLSHORTLINE(df.open.values, df.high.values, df.low.values, df.close.values)
df['CDLSPINNINGTOP'] = talib.CDLSPINNINGTOP(df.open.values, df.high.values, df.low.values, df.close.values)
df['atr'] = talib.ATR(df.high.values, df.low.values, df.close.values,timeperiod=14)
#assuming this is the Y
df['target'] = #some int value
Upvotes: 1
Views: 1758
Reputation: 33938
So don't create multiple columns with lag-n data.
Only dynamically compute the lagged data when you need it. For example, it might even be possible to frame the computation with all the lags it as multiplication by some matrix kernel. We don't know till we see your code. Show us your specific code or formula you're trying to compute if you want a more specific answer.
EDIT: to the downvoters, this is actually the more correct answer. If you clog up your dataframe with all sorts of intermediate and temporary columns, you will waste memory. On stuff you can dynamically compute as-needed. That's the wrong way to go. Same general principle as lazy-evaluation in Spark.
Upvotes: -1
Reputation: 495
I wouldn't use pandas here if you end up with a large matrix . Numpy is more appropriate here:
import numpy as np
x = you_data_frame.values
max_lag = 10 # number of lags you want to have
m = np.array([])
for i in np.arange(len(x) - max_lag):
new_row = x[i:i+max_lag][None, :]
if len(m) == 0:
m = new_row
else:
m = np.vstack([m, new_row])
then you cat fit your data using
model.fit(m[:, 1:], m[:, 0])
Upvotes: 3