How to vectorise for loop on Pandas DataFrame

Question

I have some code within which a "for loop" is run on a pandas DataFrame, and I would like to try to vectorise it as it is currently a bottleneck in the program and can take a while to run.

I have two DataFrames, 'df' and 'symbol_data'.

df.head()

                    Open Time           Close Time2         Open Price
Close Time          
29/09/2016 00:16    29/09/2016 00:01    29/09/2016 00:16    1.1200
29/09/2016 00:17    29/09/2016 00:03    29/09/2016 00:17    1.1205
29/09/2016 00:18    29/09/2016 00:03    29/09/2016 00:18    1.0225
29/09/2016 00:19    29/09/2016 00:07    29/09/2016 00:19    1.0240
29/09/2016 00:20    29/09/2016 00:15    29/09/2016 00:20    1.0241

and

symbol_data.head()

                    OPEN    HIGH    LOW     LAST_PRICE
DATE                
29/09/2016 00:01    1.1216  1.1216  1.1215  1.1216
29/09/2016 00:02    1.1216  1.1216  1.1215  1.1215
29/09/2016 00:03    1.1215  1.1216  1.1215  1.1216
29/09/2016 00:04    1.1216  1.1216  1.1216  1.1216
29/09/2016 00:05    1.1216  1.1217  1.1216  1.1217
29/09/2016 00:06    1.1217  1.1217  1.1216  1.1217
29/09/2016 00:07    1.1217  1.1217  1.1217  1.1217
29/09/2016 00:08    1.1217  1.1217  1.1217  1.1217
29/09/2016 00:09    1.1217  1.1217  1.1217  1.1217
29/09/2016 00:10    1.1217  1.1217  1.1217  1.1217
29/09/2016 00:11    1.1217  1.1217  1.1217  1.1217
29/09/2016 00:12    1.1217  1.1218  1.1217  1.1218
29/09/2016 00:13    1.1218  1.1218  1.1217  1.1217
29/09/2016 00:14    1.1217  1.1218  1.1217  1.1218
29/09/2016 00:15    1.1218  1.1218  1.1217  1.1217
29/09/2016 00:16    1.1217  1.1218  1.1217  1.1217
29/09/2016 00:17    1.1217  1.1218  1.1217  1.1217
29/09/2016 00:18    1.1217  1.1217  1.1217  1.1217
29/09/2016 00:19    1.1217  1.1217  1.1217  1.1217
29/09/2016 00:20    1.1217  1.1218  1.1217  1.1218

The 'for loop' is as follows:

for row in range(len(df)):

        df['Max Pips'][row]  = symbol_data.loc[df['Open Time'][row]:df['Close Time2'][row]]['HIGH'].max() -  df['Open Price'][row]
        df['Min Pips'][row]  = symbol_data.loc[df['Open Time'][row]:df['Close Time2'][row]]['LOW'].min() -  df['Open Price'][row]

The code basically takes each row from 'df' which is an individual trade, and cross references the data in 'symbol_data' to find out the min and max prices reached throughout the lifetime of that specific trade...it then subtracts the opening price of the trade from that max or min value to calculate the maximum distance that trade went "onside" and "offside" while it was open.

I can't figure out how to vectorise the code - I'm relatively new to coding and have generally used 'for loops' up until now.

Could anyone point me in the right direction or provide any hints as to how to achieve this vectorisaton?

Thanks.

EDIT:

So I have tried the code kindly provided by Grr and I can replicate it and get it to work on the small test data I provided but when I try to run it on my full data I keep getting the error message:

ValueError                                Traceback (most recent call last)
 in ()
     93     shared_times = symbol_data[symbol_data.index.isin(df.index)].index
     94 
---> 95     df['Max Pips'] = symbol_data.loc[(shared_times >= df['Open Time']) & (shared_times <= df['Close Time2'])]['HIGH'].max() -  df['Open Price']
     96     df['Min Pips'] = symbol_data.loc[(shared_times >= df['Open Time']) & (shared_times <= df['Close Time2'])]['LOW'].min() -  df['Open Price']
     97 

C:\Users\stuart.jamieson\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas	series\index.py in wrapper(self, other)
    112             elif not isinstance(other, (np.ndarray, Index, ABCSeries)):
    113                 other = _ensure_datetime64(other)
--> 114             result = func(np.asarray(other))
    115             result = _values_from_object(result)
    116 

C:\Users\stuart.jamieson\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\indexes\base.py in _evaluate_compare(self, other)
   3350                 if isinstance(other, (np.ndarray, Index, ABCSeries)):
   3351                     if other.ndim > 0 and len(self) != len(other):
-> 3352                         raise ValueError('Lengths must match to compare')
   3353 
   3354                 # we may need to directly compare underlying

ValueError: Lengths must match to compare

I have narrowed it down to the following piece of code:

shared_times >= df['Open Time']

When I try

shared_times >= df['Open Time'][0]

I get:

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True], dtype=bool)

So I know all the indices are correctly formated as "DatetimeIndex".

type(shared_times[0])

pandas.tslib.Timestamp


type(df['Open Time'][0])

pandas.tslib.Timestamp


type(df['Close Time2'][0])

pandas.tslib.Timestamp

Could anyone suggest how I can get past this error message?

Maarten Fabr&#233; · Accepted Answer

I see a few problems with this code.

Duplication

Why do you need the 'Close Date2' column? It's just a copy of the index

Iteration

Iterating over rows in a Dataframe can be a lot easier

If you take row names without spaces, you can use the following method

for row in df.itertuples():
#     print(row)
    prices = symbol_data.loc[row.Open_Time:row.Index]
    df.loc[row.Index, 'Max Pips']  = prices['HIGH'].max() -  row.Open_Price
    df.loc[row.Index, 'Min Pips']  = prices['LOW'].min() -  row.Open_Price

This should minimize the going forwards and backwards between the different dataframe and increase performance, but is not real vectorization.

You could try to vectorize part of this calculation like this

price_max = pd.Series(index=df.index, dtype=float)
price_min = pd.Series(index=df.index, dtype=float)

for row in df.itertuples():
#     print(row)
    prices = symbol_data.loc[row.Open_Time:row.Index]
    price_max[row.Index]  = prices['HIGH'].max()
    price_min[row.Index]  = prices['LOW'].min()
df['Max Pips2'] = price_max - df['Open_Price']
df['Min Pips2'] = price_min - df['Open_Price']

But I don't think this will yield much of a difference

How to vectorise for loop on Pandas DataFrame

Answers (2)

Duplication

Iteration

Related Questions