How to optimize changing a value in a Pandas Data Frame column

Question

I'm trying to find how much a stock will change from a given day to n days in the future. The only problem is that it takes about a minute to run this on 1000 lines of data and I have millions of lines. I think the 'lag' is caused by the line:

stocks[0][i][string][line[index]] = adjPctChange(line[adjClose],line[num])

I'm thinking that the whole 3d data frame of 500 stocks might be being copied every time this line is hit or something, but I just don't know for certain, or know how to make it faster. Also, it's throwing this warning:

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

Here's my code:

daysForeward = 2
for days in range(1,daysForeward+1):
    string = 'closeShift'+str(days)
    stocks[0][i][string] = stocks[0][i]['adjClose'].shift(days-(days*2))

for line in stocks[0][i].itertuples():
    num = 6 #first closeShift columnb
    for days in range(1,daysForeward+1):
        string = 'closeShift'+str(days)
        stocks[0][i][string][line[index]] = adjPctChange(line[adjClose],line[num])
        num+=1

Here's the data before and after applying the percent change:

       date     open    close  adjClose  closeShift1  closeShift2
0  19980102  20.3835  20.4417       NaN          NaN     0.984507
1  19980105  20.5097  20.5679       NaN     0.984507     1.034904
2  19980106  20.1408  20.0826  0.984507     1.034904     0.994047
3  19980107  20.1408  20.9950  1.034904     0.994047     0.982926
4  19980108  21.1115  20.0244  0.994047     0.982926     0.989441

       date     open    close  adjClose  closeShift1  closeShift2
0  19980102  20.3835  20.4417       NaN          NaN          NaN
1  19980105  20.5097  20.5679       NaN          NaN          NaN
2  19980106  20.1408  20.0826  0.984507     4.869735     0.959720
3  19980107  20.1408  20.9950  1.034904    -3.947904    -5.022423
4  19980108  21.1115  20.0244  0.994047    -1.118683    -0.463311

Some explanations:

The [0] in stocks[0][i] is just to get to the proper level in the 3d data frame, the [i] is for the stock name in the stocks that are being iterated through in a higher up for loop.

The adjClose column is just a modified version of close which is what I prefer using instead of close.

adjPctChange() is a custom percent change function that switches the equation around so that 100 to 50 will product the same result as 50 to 100 so the results can be averaged and won't skew upwards.

def adjPctChange(startPoint, currentPoint):
    if startPoint < currentPoint:
        x = abs(((float(startPoint)-currentPoint)/float(currentPoint))*100.0)
    else:
        x = ((float(currentPoint)-startPoint)/float(startPoint))*100.0    
    return x

Thanks to anyone who can help!

chrisaycock · Accepted Answer

You shouldn't loop over a DataFrame; just do everything with array functions.

Before:

In [30]: df
Out[30]:
       date     open    close  adjClose  closeShift1  closeShift2
0  19980102  20.3835  20.4417       NaN          NaN     0.984507
1  19980105  20.5097  20.5679       NaN     0.984507     1.034904
2  19980106  20.1408  20.0826  0.984507     1.034904     0.994047
3  19980107  20.1408  20.9950  1.034904     0.994047     0.982926
4  19980108  21.1115  20.0244  0.994047     0.982926     0.989441

Array notation:

daysForeward = 2
for day in range(1, daysForeward+1):
    column = 'closeShift' + str(day)
    df[column] = (df[column] - df.adjClose) / np.maximum(df[column], df.adjClose) * 100.0

After:

In [33]: df
Out[33]:
       date     open    close  adjClose  closeShift1  closeShift2
0  19980102  20.3835  20.4417       NaN          NaN          NaN
1  19980105  20.5097  20.5679       NaN          NaN          NaN
2  19980106  20.1408  20.0826  0.984507     4.869727     0.959713
3  19980107  20.1408  20.9950  1.034904    -3.947902    -5.022495
4  19980108  21.1115  20.0244  0.994047    -1.118760    -0.463358

How to optimize changing a value in a Pandas Data Frame column

Answers (2)

Related Questions