s666
s666

Reputation: 537

How to vectorise for loop on Pandas DataFrame

I have some code within which a "for loop" is run on a pandas DataFrame, and I would like to try to vectorise it as it is currently a bottleneck in the program and can take a while to run.

I have two DataFrames, 'df' and 'symbol_data'.

df.head()

                    Open Time           Close Time2         Open Price
Close Time          
29/09/2016 00:16    29/09/2016 00:01    29/09/2016 00:16    1.1200
29/09/2016 00:17    29/09/2016 00:03    29/09/2016 00:17    1.1205
29/09/2016 00:18    29/09/2016 00:03    29/09/2016 00:18    1.0225
29/09/2016 00:19    29/09/2016 00:07    29/09/2016 00:19    1.0240
29/09/2016 00:20    29/09/2016 00:15    29/09/2016 00:20    1.0241

and

symbol_data.head()

                    OPEN    HIGH    LOW     LAST_PRICE
DATE                
29/09/2016 00:01    1.1216  1.1216  1.1215  1.1216
29/09/2016 00:02    1.1216  1.1216  1.1215  1.1215
29/09/2016 00:03    1.1215  1.1216  1.1215  1.1216
29/09/2016 00:04    1.1216  1.1216  1.1216  1.1216
29/09/2016 00:05    1.1216  1.1217  1.1216  1.1217
29/09/2016 00:06    1.1217  1.1217  1.1216  1.1217
29/09/2016 00:07    1.1217  1.1217  1.1217  1.1217
29/09/2016 00:08    1.1217  1.1217  1.1217  1.1217
29/09/2016 00:09    1.1217  1.1217  1.1217  1.1217
29/09/2016 00:10    1.1217  1.1217  1.1217  1.1217
29/09/2016 00:11    1.1217  1.1217  1.1217  1.1217
29/09/2016 00:12    1.1217  1.1218  1.1217  1.1218
29/09/2016 00:13    1.1218  1.1218  1.1217  1.1217
29/09/2016 00:14    1.1217  1.1218  1.1217  1.1218
29/09/2016 00:15    1.1218  1.1218  1.1217  1.1217
29/09/2016 00:16    1.1217  1.1218  1.1217  1.1217
29/09/2016 00:17    1.1217  1.1218  1.1217  1.1217
29/09/2016 00:18    1.1217  1.1217  1.1217  1.1217
29/09/2016 00:19    1.1217  1.1217  1.1217  1.1217
29/09/2016 00:20    1.1217  1.1218  1.1217  1.1218

The 'for loop' is as follows:

for row in range(len(df)):

        df['Max Pips'][row]  = symbol_data.loc[df['Open Time'][row]:df['Close Time2'][row]]['HIGH'].max() -  df['Open Price'][row]
        df['Min Pips'][row]  = symbol_data.loc[df['Open Time'][row]:df['Close Time2'][row]]['LOW'].min() -  df['Open Price'][row] 

The code basically takes each row from 'df' which is an individual trade, and cross references the data in 'symbol_data' to find out the min and max prices reached throughout the lifetime of that specific trade...it then subtracts the opening price of the trade from that max or min value to calculate the maximum distance that trade went "onside" and "offside" while it was open.

I can't figure out how to vectorise the code - I'm relatively new to coding and have generally used 'for loops' up until now.

Could anyone point me in the right direction or provide any hints as to how to achieve this vectorisaton?

Thanks.

EDIT:

So I have tried the code kindly provided by Grr and I can replicate it and get it to work on the small test data I provided but when I try to run it on my full data I keep getting the error message:

ValueError                                Traceback (most recent call last)
<ipython-input-113-19bc1c85f243> in <module>()
     93     shared_times = symbol_data[symbol_data.index.isin(df.index)].index
     94 
---> 95     df['Max Pips'] = symbol_data.loc[(shared_times >= df['Open Time']) & (shared_times <= df['Close Time2'])]['HIGH'].max() -  df['Open Price']
     96     df['Min Pips'] = symbol_data.loc[(shared_times >= df['Open Time']) & (shared_times <= df['Close Time2'])]['LOW'].min() -  df['Open Price']
     97 

C:\Users\stuart.jamieson\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\tseries\index.py in wrapper(self, other)
    112             elif not isinstance(other, (np.ndarray, Index, ABCSeries)):
    113                 other = _ensure_datetime64(other)
--> 114             result = func(np.asarray(other))
    115             result = _values_from_object(result)
    116 

C:\Users\stuart.jamieson\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\indexes\base.py in _evaluate_compare(self, other)
   3350                 if isinstance(other, (np.ndarray, Index, ABCSeries)):
   3351                     if other.ndim > 0 and len(self) != len(other):
-> 3352                         raise ValueError('Lengths must match to compare')
   3353 
   3354                 # we may need to directly compare underlying

ValueError: Lengths must match to compare

I have narrowed it down to the following piece of code:

shared_times >= df['Open Time']

When I try

shared_times >= df['Open Time'][0]

I get:

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True], dtype=bool)

So I know all the indices are correctly formated as "DatetimeIndex".

type(shared_times[0])

pandas.tslib.Timestamp


type(df['Open Time'][0])

pandas.tslib.Timestamp


type(df['Close Time2'][0])

pandas.tslib.Timestamp

Could anyone suggest how I can get past this error message?

Upvotes: 0

Views: 3971

Answers (2)

Grr
Grr

Reputation: 16109

So it looks to me like there is a lot going on here beyond just trying to vectorize some code. Let's break down what you are doing.

For just the first step in each loop:

df['Max Pips'][row]  = symbol_data.loc[df['Open Time'][row]:df['Close Time2'][row]]['HIGH'].max() -  df['Open Price'][row]

When you do symbol_data.loc[df['Open Time'][row]:df['Close Time2'][row]] you are behind the scenes having pandas create a pandas.DatetimeIndex constructed from pandas.date_range. So essentially for each row you are creating an array of tens of thousands of datetimes. Unfortunately, pandas cannot do this on an entire column, being you cannot do symbol_data.loc[df['Open Time']:df['Close Time2']]. So in this case that is the step that is really blocking you from being able to vectorize your code.

First let's baseline your code. Given the example you provided I wrapped your for loop into a function calc_time and timed its execution.

In [202]: def calc_time():
     df['Max Pips'] = 0.0
     df['Min Pips'] = 0.0
     for row in range(len(df1)):
         df['Max Pips'][row]  = symbol_data.loc[df['Open Time'][row]:df1['Close Time2'][row]]['High'].max() -  df['Open Price'][row]
         df['Min Pips'][row]  = symbol_data.loc[df['Open Time'][row]:df1['Clo se Time2'][row]]['Low'].min() -  df['Open Price'][row]
In [203]: %time calc()
/Users/grr/anaconda/bin/ipython:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  sys.exit(IPython.start_ipython())
/Users/grr/anaconda/bin/ipython:7: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
CPU times: user 281 ms, sys: 3.46 ms, total: 284 ms
Wall time: 284 ms

So the total time was 284 ms. Not so good for 5 lines. Not to mention you get a cascade of warnings. We can do better.

As I mentioned above your blocker is the way you are indexing on the date range. One way to get around this is by finding all of the indices in symbol_data that are also in df. That can be accomplished with the pandas.Series.isin method.

In [204]: shared_times = symbol_data[symbol_data.index.isin(df.index)].index
In [205]: shared_times
Out[205]:
Index(['29/09/2016 00:16', '29/09/2016 00:17', '29/09/2016 00:18',
   '29/09/2016 00:19', '29/09/2016 00:20'],
  dtype='object')

Now we can use your logic in a vectorized fashion like so (after dropping the Max Pips and Min Pips columns to assure purity of the experiment):

In [207]: def calc_time_vec():
     df['Max Pips'] = symbol_data.loc[(shared_time >= df['Open time']) & (shared_times <= df1['Close Time2'])]['HIGH'].max() -  df['Open Price']
     df['Min Pips'] = symbol_data.loc[(shared_time >= df['Open time']) & (shared_times <= df1['Close Time2'])]['LOW'].min() -  df['Open Price'][row]
In [208]: %time calc_time_vec()
CPU times: user 2.98 ms, sys: 167 µs, total: 3.15 ms
Wall time: 3.04 ms

That took only 3.15 ms, a ~90X speed improvement! Or if you want to be very conservative about the improvement we could add the assignment of shared_times into the function itself.

In [210]: def calc_time_vec():
     shared_times = symbol_data[symbol_data.index.isin(df.index)].index
     df['Max Pips'] = symbol_data.loc[(shared_times >= df['Open time']) & (shared_times <= df1['Close Time2'])]['HIGH'].max() -  df['Open Price']
     df['Min Pips'] = symbol_data.loc[(shared_times >= df['Open time']) & (shared_times <= df1['Close Time2'])]['LOW'].min() -  df['Open Price']
In [211]: %time calc_time_vec()
CPU times: user 3.23 ms, sys: 171 µs, total: 3.4 ms
Wall time: 3.28 ms

Still our improvement is around 84x, which is still pretty good. That being said we can still improve the function. Twice we repeat the calculation of the boolean array for the .loc argument. Let's fix that.

In [213]: def calc_time_vec():
     shared_times = symbol_data[symbol_data.index.isin(df.index)].index
     bool_arr = (shared_times >= df['Open time']) & (shared_times <= df1['Close Time2'])
     df['Max Pips'] = symbol_data.loc[bool_arr]['HIGH'].max() -  df['Open Price']
     df['Min Pips'] = symbol_data.loc[bool_arr]['LOW'].min() -  df['Open Price']  
In [214]: %time calc_time_vec()
CPU times: user 2.83 ms, sys: 134 µs, total: 2.96 ms
Wall time: 2.87 ms      

Alright. Now we are down to 2.96 ms or an ~96x improvement over the original function.

I hope that this sheds some light on how to go about trying to vectorize and improve more complex functions like this one. A lot of the times even if a code is for the most part vectorized, there are still gains to be found by using built int pandas or NumPy methods, and making sure you don't repeat yourself.

Upvotes: 1

Maarten Fabr&#233;
Maarten Fabr&#233;

Reputation: 7058

I see a few problems with this code.

Duplication

Why do you need the 'Close Date2' column? It's just a copy of the index

Iteration

Iterating over rows in a Dataframe can be a lot easier

If you take row names without spaces, you can use the following method

for row in df.itertuples():
#     print(row)
    prices = symbol_data.loc[row.Open_Time:row.Index]
    df.loc[row.Index, 'Max Pips']  = prices['HIGH'].max() -  row.Open_Price
    df.loc[row.Index, 'Min Pips']  = prices['LOW'].min() -  row.Open_Price

This should minimize the going forwards and backwards between the different dataframe and increase performance, but is not real vectorization.

You could try to vectorize part of this calculation like this

price_max = pd.Series(index=df.index, dtype=float)
price_min = pd.Series(index=df.index, dtype=float)

for row in df.itertuples():
#     print(row)
    prices = symbol_data.loc[row.Open_Time:row.Index]
    price_max[row.Index]  = prices['HIGH'].max()
    price_min[row.Index]  = prices['LOW'].min()
df['Max Pips2'] = price_max - df['Open_Price']
df['Min Pips2'] = price_min - df['Open_Price']

But I don't think this will yield much of a difference

Upvotes: 2

Related Questions