Marshall Farrier
Marshall Farrier

Reputation: 967

vectorize for-loop to fill Pandas DataFrame

For a financial application, I'm trying to create a DataFrame where each row is a session date value for a particular equity. To get the data, I'm using Pandas Remote Data. So, for example, the features I'm trying to create might be the adjusted closes for the preceding 32 sessions.

This is easy to do in a for-loop, but it takes quite a long time for large features sets (like going back to 1960 on "ge" and making each row contain the preceding 256 session values). Does anyone see a good way to vectorize this code?

import pandas as pd

def featurize(equity_data, n_sessions, col_label='Adj Close'):
    """ 
    Generate a raw (unnormalized) feature set from the input data.
    The value at col_label on the given date is taken
    as a feature, and each row contains values for n_sessions
    """
    features = pd.DataFrame(index=equity_data.index[(n_sessions - 1):],
            columns=range((-n_sessions + 1), 1)) 
    for i in range(len(features.index)):
        features.iloc[i, :] = equity_data[i:(n_sessions + i)][col_label].values
    return features

I could alternatively just multi-thread this easily, but I'm guessing that pandas does that automatically if I can vectorize it. I mention that mainly because my primary concern is performance. So, if multi-threading is likely to outperform vectorization in any significant way, then I'd prefer that.

Short example of input and output:

>>> eq_data Open High Low Close Volume Adj Close Date
2014-01-02 15.42 15.45 15.28 15.44 31528500 14.96 2014-01-03 15.52 15.64 15.30 15.51 46122300 15.02 2014-01-06 15.72 15.76 15.52 15.58 42657600 15.09 2014-01-07 15.73 15.74 15.35 15.38 54476300 14.90 2014-01-08 15.60 15.71 15.51 15.54 48448300 15.05 2014-01-09 15.83 16.02 15.77 15.84 67836500 15.34 2014-01-10 16.01 16.11 15.94 16.07 44984000 15.57 2014-01-13 16.37 16.53 16.08 16.11 57566400 15.61 2014-01-14 16.31 16.43 16.17 16.40 44039200 15.89 2014-01-15 16.37 16.73 16.35 16.70 64118200 16.18 2014-01-16 16.67 16.76 16.56 16.73 38410800 16.21 2014-01-17 16.78 16.78 16.45 16.52 37152100 16.00 2014-01-21 16.64 16.68 16.36 16.41 35597200 15.90 2014-01-22 16.44 16.62 16.37 16.55 28741900 16.03 2014-01-23 16.49 16.53 16.31 16.43 37860800 15.92 2014-01-24 16.19 16.21 15.78 15.83 66023500 15.33 2014-01-27 15.90 15.91 15.52 15.71 51218700 15.22 2014-01-28 15.97 16.01 15.51 15.72 57677500 15.23 2014-01-29 15.48 15.53 15.20 15.26 52241500 14.90 2014-01-30 15.43 15.45 15.18 15.25 32654100 14.89 2014-01-31 15.09 15.10 14.90 14.96 64132600 14.61 >>> features = data.featurize(eq_data, 3) >>> features -2 -1 0 Date
2014-01-06 14.96 15.02 15.09 2014-01-07 15.02 15.09 14.9 2014-01-08 15.09 14.9 15.05 2014-01-09 14.9 15.05 15.34 2014-01-10 15.05 15.34 15.57 2014-01-13 15.34 15.57 15.61 2014-01-14 15.57 15.61 15.89 2014-01-15 15.61 15.89 16.18 2014-01-16 15.89 16.18 16.21 2014-01-17 16.18 16.21 16 2014-01-21 16.21 16 15.9 2014-01-22 16 15.9 16.03 2014-01-23 15.9 16.03 15.92 2014-01-24 16.03 15.92 15.33 2014-01-27 15.92 15.33 15.22 2014-01-28 15.33 15.22 15.23 2014-01-29 15.22 15.23 14.9 2014-01-30 15.23 14.9 14.89 2014-01-31 14.9 14.89 14.61

So each row of features is a series of 3 (n_sessions) successive values from the 'Adj Close' column of the features DataFrame.

====================

Improved version based on Primer's answer below:

def featurize(equity_data, n_sessions, column='Adj Close'): """ Generate a raw (unnormalized) feature set from the input data. The value at column on the given date is taken as a feature, and each row contains values for n_sessions >>> timeit.timeit('data.featurize(data.get("ge", dt.date(1960, 1, 1), dt.date(2014, 12, 31)), 256)', setup=s, number=1) 1.6771750450134277 """ features = pd.DataFrame(index=equity_data.index[(n_sessions - 1):], columns=map(str, range((-n_sessions + 1), 1)), dtype='float64') values = equity_data[column].values for i in range(n_sessions - 1): features.iloc[:, i] = values[i:(-n_sessions + i + 1)] features.iloc[:, n_sessions - 1] = values[(n_sessions - 1):] return features

Upvotes: 2

Views: 1277

Answers (1)

Primer
Primer

Reputation: 10302

It looks like shift is your friend here and something like this will do:

df = pd.DataFrame({'adj close': np.random.random(10) + 15},index=pd.date_range(start='2014-01-02', periods=10, freq='B'))
df.index.name = 'date'
df

                adj close
date                 
2014-01-02     15.650
2014-01-03     15.775
2014-01-06     15.750
2014-01-07     15.464
2014-01-08     15.966
2014-01-09     15.475
2014-01-10     15.164
2014-01-13     15.281
2014-01-14     15.568
2014-01-15     15.648

features = pd.DataFrame(data=df['adj close'], index=df.index)
features.columns = ['0']
features['-1'] = df['adj close'].shift()
features['-2'] = df['adj close'].shift(2)
features.dropna(inplace=True)
features

                 0      -1      -2
date                              
2014-01-06  15.750  15.775  15.650
2014-01-07  15.464  15.750  15.775
2014-01-08  15.966  15.464  15.750
2014-01-09  15.475  15.966  15.464
2014-01-10  15.164  15.475  15.966
2014-01-13  15.281  15.164  15.475
2014-01-14  15.568  15.281  15.164
2014-01-15  15.648  15.568  15.281

Upvotes: 2

Related Questions