Reputation: 343
I'm trying to find how much a stock will change from a given day to n days in the future. The only problem is that it takes about a minute to run this on 1000 lines of data and I have millions of lines. I think the 'lag' is caused by the line:
stocks[0][i][string][line[index]] = adjPctChange(line[adjClose],line[num])
I'm thinking that the whole 3d data frame of 500 stocks might be being copied every time this line is hit or something, but I just don't know for certain, or know how to make it faster. Also, it's throwing this warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
Here's my code:
daysForeward = 2
for days in range(1,daysForeward+1):
string = 'closeShift'+str(days)
stocks[0][i][string] = stocks[0][i]['adjClose'].shift(days-(days*2))
for line in stocks[0][i].itertuples():
num = 6 #first closeShift columnb
for days in range(1,daysForeward+1):
string = 'closeShift'+str(days)
stocks[0][i][string][line[index]] = adjPctChange(line[adjClose],line[num])
num+=1
Here's the data before and after applying the percent change:
date open close adjClose closeShift1 closeShift2
0 19980102 20.3835 20.4417 NaN NaN 0.984507
1 19980105 20.5097 20.5679 NaN 0.984507 1.034904
2 19980106 20.1408 20.0826 0.984507 1.034904 0.994047
3 19980107 20.1408 20.9950 1.034904 0.994047 0.982926
4 19980108 21.1115 20.0244 0.994047 0.982926 0.989441
date open close adjClose closeShift1 closeShift2
0 19980102 20.3835 20.4417 NaN NaN NaN
1 19980105 20.5097 20.5679 NaN NaN NaN
2 19980106 20.1408 20.0826 0.984507 4.869735 0.959720
3 19980107 20.1408 20.9950 1.034904 -3.947904 -5.022423
4 19980108 21.1115 20.0244 0.994047 -1.118683 -0.463311
Some explanations:
The [0]
in stocks[0][i]
is just to get to the proper level in the 3d data frame, the [i]
is for the stock name in the stocks that are being iterated through in a higher up for loop.
The adjClose
column is just a modified version of close
which is what I prefer using instead of close
.
adjPctChange()
is a custom percent change function that switches the equation around so that 100 to 50 will product the same result as 50 to 100 so the results can be averaged and won't skew upwards.
def adjPctChange(startPoint, currentPoint):
if startPoint < currentPoint:
x = abs(((float(startPoint)-currentPoint)/float(currentPoint))*100.0)
else:
x = ((float(currentPoint)-startPoint)/float(startPoint))*100.0
return x
Thanks to anyone who can help!
Upvotes: 2
Views: 248
Reputation: 37928
You shouldn't loop over a DataFrame; just do everything with array functions.
Before:
In [30]: df
Out[30]:
date open close adjClose closeShift1 closeShift2
0 19980102 20.3835 20.4417 NaN NaN 0.984507
1 19980105 20.5097 20.5679 NaN 0.984507 1.034904
2 19980106 20.1408 20.0826 0.984507 1.034904 0.994047
3 19980107 20.1408 20.9950 1.034904 0.994047 0.982926
4 19980108 21.1115 20.0244 0.994047 0.982926 0.989441
Array notation:
daysForeward = 2
for day in range(1, daysForeward+1):
column = 'closeShift' + str(day)
df[column] = (df[column] - df.adjClose) / np.maximum(df[column], df.adjClose) * 100.0
After:
In [33]: df
Out[33]:
date open close adjClose closeShift1 closeShift2
0 19980102 20.3835 20.4417 NaN NaN NaN
1 19980105 20.5097 20.5679 NaN NaN NaN
2 19980106 20.1408 20.0826 0.984507 4.869727 0.959713
3 19980107 20.1408 20.9950 1.034904 -3.947902 -5.022495
4 19980108 21.1115 20.0244 0.994047 -1.118760 -0.463358
Upvotes: 2
Reputation: 294506
IIUC:
I started off with this dataframe:
print df
date open close adjclose
0 19980102 20.3835 20.4417 0.984507
1 19980105 20.5097 20.5679 1.034904
2 19980106 20.1408 20.0826 0.994047
3 19980107 20.1408 20.9950 0.982926
4 19980108 21.1115 20.0244 0.989441
Then I created these functions:
def get_lags(s, n):
return pd.concat([s.shift(i) for i in range(n + 1)],
axis=1, keys=range(n + 1))
def get_comps(lags):
comps = []
for i, cni in enumerate(lags.columns):
if i > 0:
max_ = lags.iloc[:, [0, i]].max(1)
min_ = lags.iloc[:, [0, i]].min(1)
comps.append((max_ / min_ - 1) * 100)
return pd.concat(comps, axis=1)
Then I get lags and compare them:
print get_comps(get_lags(df.adjclose, 2))
0 1
0 0.000000 0.000000
1 5.119009 0.000000
2 4.110168 0.969013
3 1.131418 5.288089
4 0.662817 0.465515
Finally, I concatenate them with df
print pd.concat([df, get_comps(get_lags(df.adjclose, 2))], axis=1)
date open close adjclose 0 1
0 19980102 20.3835 20.4417 0.984507 0.000000 0.000000
1 19980105 20.5097 20.5679 1.034904 5.119009 0.000000
2 19980106 20.1408 20.0826 0.994047 4.110168 0.969013
3 19980107 20.1408 20.9950 0.982926 1.131418 5.288089
4 19980108 21.1115 20.0244 0.989441 0.662817 0.465515
Modify as you need.
Upvotes: 0