vectorize for-loop to fill Pandas DataFrame

Question

For a financial application, I'm trying to create a DataFrame where each row is a session date value for a particular equity. To get the data, I'm using Pandas Remote Data. So, for example, the features I'm trying to create might be the adjusted closes for the preceding 32 sessions.

This is easy to do in a for-loop, but it takes quite a long time for large features sets (like going back to 1960 on "ge" and making each row contain the preceding 256 session values). Does anyone see a good way to vectorize this code?

import pandas as pd

def featurize(equity_data, n_sessions, col_label='Adj Close'):
    """ 
    Generate a raw (unnormalized) feature set from the input data.
    The value at col_label on the given date is taken
    as a feature, and each row contains values for n_sessions
    """
    features = pd.DataFrame(index=equity_data.index[(n_sessions - 1):],
            columns=range((-n_sessions + 1), 1)) 
    for i in range(len(features.index)):
        features.iloc[i, :] = equity_data[i:(n_sessions + i)][col_label].values
    return features

I could alternatively just multi-thread this easily, but I'm guessing that pandas does that automatically if I can vectorize it. I mention that mainly because my primary concern is performance. So, if multi-threading is likely to outperform vectorization in any significant way, then I'd prefer that.

Short example of input and output:

>>> eq_data
             Open   High    Low  Close    Volume  Adj Close
Date

2014-01-02  15.42  15.45  15.28  15.44  31528500      14.96
2014-01-03  15.52  15.64  15.30  15.51  46122300      15.02
2014-01-06  15.72  15.76  15.52  15.58  42657600      15.09
2014-01-07  15.73  15.74  15.35  15.38  54476300      14.90
2014-01-08  15.60  15.71  15.51  15.54  48448300      15.05
2014-01-09  15.83  16.02  15.77  15.84  67836500      15.34
2014-01-10  16.01  16.11  15.94  16.07  44984000      15.57
2014-01-13  16.37  16.53  16.08  16.11  57566400      15.61
2014-01-14  16.31  16.43  16.17  16.40  44039200      15.89
2014-01-15  16.37  16.73  16.35  16.70  64118200      16.18
2014-01-16  16.67  16.76  16.56  16.73  38410800      16.21
2014-01-17  16.78  16.78  16.45  16.52  37152100      16.00
2014-01-21  16.64  16.68  16.36  16.41  35597200      15.90
2014-01-22  16.44  16.62  16.37  16.55  28741900      16.03
2014-01-23  16.49  16.53  16.31  16.43  37860800      15.92
2014-01-24  16.19  16.21  15.78  15.83  66023500      15.33
2014-01-27  15.90  15.91  15.52  15.71  51218700      15.22
2014-01-28  15.97  16.01  15.51  15.72  57677500      15.23
2014-01-29  15.48  15.53  15.20  15.26  52241500      14.90
2014-01-30  15.43  15.45  15.18  15.25  32654100      14.89
2014-01-31  15.09  15.10  14.90  14.96  64132600      14.61
>>> features = data.featurize(eq_data, 3)
>>> features
               -2     -1      0
Date

2014-01-06  14.96  15.02  15.09
2014-01-07  15.02  15.09   14.9
2014-01-08  15.09   14.9  15.05
2014-01-09   14.9  15.05  15.34
2014-01-10  15.05  15.34  15.57
2014-01-13  15.34  15.57  15.61
2014-01-14  15.57  15.61  15.89
2014-01-15  15.61  15.89  16.18
2014-01-16  15.89  16.18  16.21
2014-01-17  16.18  16.21     16
2014-01-21  16.21     16   15.9
2014-01-22     16   15.9  16.03
2014-01-23   15.9  16.03  15.92
2014-01-24  16.03  15.92  15.33
2014-01-27  15.92  15.33  15.22
2014-01-28  15.33  15.22  15.23
2014-01-29  15.22  15.23   14.9
2014-01-30  15.23   14.9  14.89
2014-01-31   14.9  14.89  14.61

So each row of features is a series of 3 (n_sessions) successive values from the 'Adj Close' column of the features DataFrame.

====================

Improved version based on Primer's answer below:

def featurize(equity_data, n_sessions, column='Adj Close'): """ Generate a raw (unnormalized) feature set from the input data. The value at column on the given date is taken as a feature, and each row contains values for n_sessions >>> timeit.timeit('data.featurize(data.get("ge", dt.date(1960, 1, 1), dt.date(2014, 12, 31)), 256)', setup=s, number=1) 1.6771750450134277 """ features = pd.DataFrame(index=equity_data.index[(n_sessions - 1):], columns=map(str, range((-n_sessions + 1), 1)), dtype='float64') values = equity_data[column].values for i in range(n_sessions - 1): features.iloc[:, i] = values[i:(-n_sessions + i + 1)] features.iloc[:, n_sessions - 1] = values[(n_sessions - 1):] return features

Primer · Accepted Answer

It looks like shift is your friend here and something like this will do:

df = pd.DataFrame({'adj close': np.random.random(10) + 15},index=pd.date_range(start='2014-01-02', periods=10, freq='B'))
df.index.name = 'date'
df

                adj close
date                 
2014-01-02     15.650
2014-01-03     15.775
2014-01-06     15.750
2014-01-07     15.464
2014-01-08     15.966
2014-01-09     15.475
2014-01-10     15.164
2014-01-13     15.281
2014-01-14     15.568
2014-01-15     15.648

features = pd.DataFrame(data=df['adj close'], index=df.index)
features.columns = ['0']
features['-1'] = df['adj close'].shift()
features['-2'] = df['adj close'].shift(2)
features.dropna(inplace=True)
features

                 0      -1      -2
date                              
2014-01-06  15.750  15.775  15.650
2014-01-07  15.464  15.750  15.775
2014-01-08  15.966  15.464  15.750
2014-01-09  15.475  15.966  15.464
2014-01-10  15.164  15.475  15.966
2014-01-13  15.281  15.164  15.475
2014-01-14  15.568  15.281  15.164
2014-01-15  15.648  15.568  15.281

vectorize for-loop to fill Pandas DataFrame

Answers (1)

Related Questions