Reputation: 537
I have some code within which a "for loop" is run on a pandas DataFrame, and I would like to try to vectorise it as it is currently a bottleneck in the program and can take a while to run.
I have two DataFrames, 'df' and 'symbol_data'.
df.head()
Open Time Close Time2 Open Price
Close Time
29/09/2016 00:16 29/09/2016 00:01 29/09/2016 00:16 1.1200
29/09/2016 00:17 29/09/2016 00:03 29/09/2016 00:17 1.1205
29/09/2016 00:18 29/09/2016 00:03 29/09/2016 00:18 1.0225
29/09/2016 00:19 29/09/2016 00:07 29/09/2016 00:19 1.0240
29/09/2016 00:20 29/09/2016 00:15 29/09/2016 00:20 1.0241
and
symbol_data.head()
OPEN HIGH LOW LAST_PRICE
DATE
29/09/2016 00:01 1.1216 1.1216 1.1215 1.1216
29/09/2016 00:02 1.1216 1.1216 1.1215 1.1215
29/09/2016 00:03 1.1215 1.1216 1.1215 1.1216
29/09/2016 00:04 1.1216 1.1216 1.1216 1.1216
29/09/2016 00:05 1.1216 1.1217 1.1216 1.1217
29/09/2016 00:06 1.1217 1.1217 1.1216 1.1217
29/09/2016 00:07 1.1217 1.1217 1.1217 1.1217
29/09/2016 00:08 1.1217 1.1217 1.1217 1.1217
29/09/2016 00:09 1.1217 1.1217 1.1217 1.1217
29/09/2016 00:10 1.1217 1.1217 1.1217 1.1217
29/09/2016 00:11 1.1217 1.1217 1.1217 1.1217
29/09/2016 00:12 1.1217 1.1218 1.1217 1.1218
29/09/2016 00:13 1.1218 1.1218 1.1217 1.1217
29/09/2016 00:14 1.1217 1.1218 1.1217 1.1218
29/09/2016 00:15 1.1218 1.1218 1.1217 1.1217
29/09/2016 00:16 1.1217 1.1218 1.1217 1.1217
29/09/2016 00:17 1.1217 1.1218 1.1217 1.1217
29/09/2016 00:18 1.1217 1.1217 1.1217 1.1217
29/09/2016 00:19 1.1217 1.1217 1.1217 1.1217
29/09/2016 00:20 1.1217 1.1218 1.1217 1.1218
The 'for loop' is as follows:
for row in range(len(df)):
df['Max Pips'][row] = symbol_data.loc[df['Open Time'][row]:df['Close Time2'][row]]['HIGH'].max() - df['Open Price'][row]
df['Min Pips'][row] = symbol_data.loc[df['Open Time'][row]:df['Close Time2'][row]]['LOW'].min() - df['Open Price'][row]
The code basically takes each row from 'df' which is an individual trade, and cross references the data in 'symbol_data' to find out the min and max prices reached throughout the lifetime of that specific trade...it then subtracts the opening price of the trade from that max or min value to calculate the maximum distance that trade went "onside" and "offside" while it was open.
I can't figure out how to vectorise the code - I'm relatively new to coding and have generally used 'for loops' up until now.
Could anyone point me in the right direction or provide any hints as to how to achieve this vectorisaton?
Thanks.
EDIT:
So I have tried the code kindly provided by Grr and I can replicate it and get it to work on the small test data I provided but when I try to run it on my full data I keep getting the error message:
ValueError Traceback (most recent call last)
<ipython-input-113-19bc1c85f243> in <module>()
93 shared_times = symbol_data[symbol_data.index.isin(df.index)].index
94
---> 95 df['Max Pips'] = symbol_data.loc[(shared_times >= df['Open Time']) & (shared_times <= df['Close Time2'])]['HIGH'].max() - df['Open Price']
96 df['Min Pips'] = symbol_data.loc[(shared_times >= df['Open Time']) & (shared_times <= df['Close Time2'])]['LOW'].min() - df['Open Price']
97
C:\Users\stuart.jamieson\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\tseries\index.py in wrapper(self, other)
112 elif not isinstance(other, (np.ndarray, Index, ABCSeries)):
113 other = _ensure_datetime64(other)
--> 114 result = func(np.asarray(other))
115 result = _values_from_object(result)
116
C:\Users\stuart.jamieson\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\indexes\base.py in _evaluate_compare(self, other)
3350 if isinstance(other, (np.ndarray, Index, ABCSeries)):
3351 if other.ndim > 0 and len(self) != len(other):
-> 3352 raise ValueError('Lengths must match to compare')
3353
3354 # we may need to directly compare underlying
ValueError: Lengths must match to compare
I have narrowed it down to the following piece of code:
shared_times >= df['Open Time']
When I try
shared_times >= df['Open Time'][0]
I get:
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True], dtype=bool)
So I know all the indices are correctly formated as "DatetimeIndex".
type(shared_times[0])
pandas.tslib.Timestamp
type(df['Open Time'][0])
pandas.tslib.Timestamp
type(df['Close Time2'][0])
pandas.tslib.Timestamp
Could anyone suggest how I can get past this error message?
Upvotes: 0
Views: 3971
Reputation: 16109
So it looks to me like there is a lot going on here beyond just trying to vectorize some code. Let's break down what you are doing.
For just the first step in each loop:
df['Max Pips'][row] = symbol_data.loc[df['Open Time'][row]:df['Close Time2'][row]]['HIGH'].max() - df['Open Price'][row]
When you do symbol_data.loc[df['Open Time'][row]:df['Close Time2'][row]]
you are behind the scenes having pandas create a pandas.DatetimeIndex
constructed from pandas.date_range
. So essentially for each row you are creating an array of tens of thousands of datetimes. Unfortunately, pandas cannot do this on an entire column, being you cannot do symbol_data.loc[df['Open Time']:df['Close Time2']]
. So in this case that is the step that is really blocking you from being able to vectorize your code.
First let's baseline your code. Given the example you provided I wrapped your for loop into a function calc_time
and timed its execution.
In [202]: def calc_time():
df['Max Pips'] = 0.0
df['Min Pips'] = 0.0
for row in range(len(df1)):
df['Max Pips'][row] = symbol_data.loc[df['Open Time'][row]:df1['Close Time2'][row]]['High'].max() - df['Open Price'][row]
df['Min Pips'][row] = symbol_data.loc[df['Open Time'][row]:df1['Clo se Time2'][row]]['Low'].min() - df['Open Price'][row]
In [203]: %time calc()
/Users/grr/anaconda/bin/ipython:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
sys.exit(IPython.start_ipython())
/Users/grr/anaconda/bin/ipython:7: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
CPU times: user 281 ms, sys: 3.46 ms, total: 284 ms
Wall time: 284 ms
So the total time was 284 ms. Not so good for 5 lines. Not to mention you get a cascade of warnings. We can do better.
As I mentioned above your blocker is the way you are indexing on the date range. One way to get around this is by finding all of the indices in symbol_data
that are also in df
. That can be accomplished with the pandas.Series.isin
method.
In [204]: shared_times = symbol_data[symbol_data.index.isin(df.index)].index
In [205]: shared_times
Out[205]:
Index(['29/09/2016 00:16', '29/09/2016 00:17', '29/09/2016 00:18',
'29/09/2016 00:19', '29/09/2016 00:20'],
dtype='object')
Now we can use your logic in a vectorized fashion like so (after dropping the Max Pips
and Min Pips
columns to assure purity of the experiment):
In [207]: def calc_time_vec():
df['Max Pips'] = symbol_data.loc[(shared_time >= df['Open time']) & (shared_times <= df1['Close Time2'])]['HIGH'].max() - df['Open Price']
df['Min Pips'] = symbol_data.loc[(shared_time >= df['Open time']) & (shared_times <= df1['Close Time2'])]['LOW'].min() - df['Open Price'][row]
In [208]: %time calc_time_vec()
CPU times: user 2.98 ms, sys: 167 µs, total: 3.15 ms
Wall time: 3.04 ms
That took only 3.15 ms, a ~90X speed improvement! Or if you want to be very conservative about the improvement we could add the assignment of shared_times
into the function itself.
In [210]: def calc_time_vec():
shared_times = symbol_data[symbol_data.index.isin(df.index)].index
df['Max Pips'] = symbol_data.loc[(shared_times >= df['Open time']) & (shared_times <= df1['Close Time2'])]['HIGH'].max() - df['Open Price']
df['Min Pips'] = symbol_data.loc[(shared_times >= df['Open time']) & (shared_times <= df1['Close Time2'])]['LOW'].min() - df['Open Price']
In [211]: %time calc_time_vec()
CPU times: user 3.23 ms, sys: 171 µs, total: 3.4 ms
Wall time: 3.28 ms
Still our improvement is around 84x, which is still pretty good. That being said we can still improve the function. Twice we repeat the calculation of the boolean array for the .loc
argument. Let's fix that.
In [213]: def calc_time_vec():
shared_times = symbol_data[symbol_data.index.isin(df.index)].index
bool_arr = (shared_times >= df['Open time']) & (shared_times <= df1['Close Time2'])
df['Max Pips'] = symbol_data.loc[bool_arr]['HIGH'].max() - df['Open Price']
df['Min Pips'] = symbol_data.loc[bool_arr]['LOW'].min() - df['Open Price']
In [214]: %time calc_time_vec()
CPU times: user 2.83 ms, sys: 134 µs, total: 2.96 ms
Wall time: 2.87 ms
Alright. Now we are down to 2.96 ms or an ~96x improvement over the original function.
I hope that this sheds some light on how to go about trying to vectorize and improve more complex functions like this one. A lot of the times even if a code is for the most part vectorized, there are still gains to be found by using built int pandas or NumPy methods, and making sure you don't repeat yourself.
Upvotes: 1
Reputation: 7058
I see a few problems with this code.
Why do you need the 'Close Date2' column? It's just a copy of the index
Iterating over rows in a Dataframe can be a lot easier
If you take row names without spaces, you can use the following method
for row in df.itertuples():
# print(row)
prices = symbol_data.loc[row.Open_Time:row.Index]
df.loc[row.Index, 'Max Pips'] = prices['HIGH'].max() - row.Open_Price
df.loc[row.Index, 'Min Pips'] = prices['LOW'].min() - row.Open_Price
This should minimize the going forwards and backwards between the different dataframe and increase performance, but is not real vectorization.
You could try to vectorize part of this calculation like this
price_max = pd.Series(index=df.index, dtype=float)
price_min = pd.Series(index=df.index, dtype=float)
for row in df.itertuples():
# print(row)
prices = symbol_data.loc[row.Open_Time:row.Index]
price_max[row.Index] = prices['HIGH'].max()
price_min[row.Index] = prices['LOW'].min()
df['Max Pips2'] = price_max - df['Open_Price']
df['Min Pips2'] = price_min - df['Open_Price']
But I don't think this will yield much of a difference
Upvotes: 2