puzzler
puzzler

Reputation: 343

How to optimize changing a value in a Pandas Data Frame column

I'm trying to find how much a stock will change from a given day to n days in the future. The only problem is that it takes about a minute to run this on 1000 lines of data and I have millions of lines. I think the 'lag' is caused by the line:

stocks[0][i][string][line[index]] = adjPctChange(line[adjClose],line[num])

I'm thinking that the whole 3d data frame of 500 stocks might be being copied every time this line is hit or something, but I just don't know for certain, or know how to make it faster. Also, it's throwing this warning:

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

Here's my code:

daysForeward = 2
for days in range(1,daysForeward+1):
    string = 'closeShift'+str(days)
    stocks[0][i][string] = stocks[0][i]['adjClose'].shift(days-(days*2))

for line in stocks[0][i].itertuples():
    num = 6 #first closeShift columnb
    for days in range(1,daysForeward+1):
        string = 'closeShift'+str(days)
        stocks[0][i][string][line[index]] = adjPctChange(line[adjClose],line[num])
        num+=1

Here's the data before and after applying the percent change:

       date     open    close  adjClose  closeShift1  closeShift2
0  19980102  20.3835  20.4417       NaN          NaN     0.984507
1  19980105  20.5097  20.5679       NaN     0.984507     1.034904
2  19980106  20.1408  20.0826  0.984507     1.034904     0.994047
3  19980107  20.1408  20.9950  1.034904     0.994047     0.982926
4  19980108  21.1115  20.0244  0.994047     0.982926     0.989441

       date     open    close  adjClose  closeShift1  closeShift2
0  19980102  20.3835  20.4417       NaN          NaN          NaN
1  19980105  20.5097  20.5679       NaN          NaN          NaN
2  19980106  20.1408  20.0826  0.984507     4.869735     0.959720
3  19980107  20.1408  20.9950  1.034904    -3.947904    -5.022423
4  19980108  21.1115  20.0244  0.994047    -1.118683    -0.463311

Some explanations:

The [0] in stocks[0][i] is just to get to the proper level in the 3d data frame, the [i] is for the stock name in the stocks that are being iterated through in a higher up for loop.

The adjClose column is just a modified version of close which is what I prefer using instead of close.

adjPctChange() is a custom percent change function that switches the equation around so that 100 to 50 will product the same result as 50 to 100 so the results can be averaged and won't skew upwards.

def adjPctChange(startPoint, currentPoint):
    if startPoint < currentPoint:
        x = abs(((float(startPoint)-currentPoint)/float(currentPoint))*100.0)
    else:
        x = ((float(currentPoint)-startPoint)/float(startPoint))*100.0    
    return x

Thanks to anyone who can help!

Upvotes: 2

Views: 248

Answers (2)

chrisaycock
chrisaycock

Reputation: 37928

You shouldn't loop over a DataFrame; just do everything with array functions.

Before:

In [30]: df
Out[30]:
       date     open    close  adjClose  closeShift1  closeShift2
0  19980102  20.3835  20.4417       NaN          NaN     0.984507
1  19980105  20.5097  20.5679       NaN     0.984507     1.034904
2  19980106  20.1408  20.0826  0.984507     1.034904     0.994047
3  19980107  20.1408  20.9950  1.034904     0.994047     0.982926
4  19980108  21.1115  20.0244  0.994047     0.982926     0.989441

Array notation:

daysForeward = 2
for day in range(1, daysForeward+1):
    column = 'closeShift' + str(day)
    df[column] = (df[column] - df.adjClose) / np.maximum(df[column], df.adjClose) * 100.0

After:

In [33]: df
Out[33]:
       date     open    close  adjClose  closeShift1  closeShift2
0  19980102  20.3835  20.4417       NaN          NaN          NaN
1  19980105  20.5097  20.5679       NaN          NaN          NaN
2  19980106  20.1408  20.0826  0.984507     4.869727     0.959713
3  19980107  20.1408  20.9950  1.034904    -3.947902    -5.022495
4  19980108  21.1115  20.0244  0.994047    -1.118760    -0.463358

Upvotes: 2

piRSquared
piRSquared

Reputation: 294506

IIUC:

I started off with this dataframe:

print df

       date     open    close  adjclose
0  19980102  20.3835  20.4417  0.984507
1  19980105  20.5097  20.5679  1.034904
2  19980106  20.1408  20.0826  0.994047
3  19980107  20.1408  20.9950  0.982926
4  19980108  21.1115  20.0244  0.989441

Then I created these functions:

def get_lags(s, n):
    return pd.concat([s.shift(i) for i in range(n + 1)],
                     axis=1, keys=range(n + 1))

def get_comps(lags):
    comps = []
    for i, cni in enumerate(lags.columns):
        if i > 0:
            max_ = lags.iloc[:, [0, i]].max(1)
            min_ = lags.iloc[:, [0, i]].min(1)
            comps.append((max_ / min_ - 1) * 100)
    return pd.concat(comps, axis=1)

Then I get lags and compare them:

print get_comps(get_lags(df.adjclose, 2))



          0         1
0  0.000000  0.000000
1  5.119009  0.000000
2  4.110168  0.969013
3  1.131418  5.288089
4  0.662817  0.465515

Finally, I concatenate them with df

print pd.concat([df, get_comps(get_lags(df.adjclose, 2))], axis=1)

       date     open    close  adjclose         0         1
0  19980102  20.3835  20.4417  0.984507  0.000000  0.000000
1  19980105  20.5097  20.5679  1.034904  5.119009  0.000000
2  19980106  20.1408  20.0826  0.994047  4.110168  0.969013
3  19980107  20.1408  20.9950  0.982926  1.131418  5.288089
4  19980108  21.1115  20.0244  0.989441  0.662817  0.465515

Modify as you need.

Upvotes: 0

Related Questions