Reputation: 1212
The table below shows an example of a strategy where a signal is generated in row 2
and then an opposite signal is generated in row 5
.
row open_signal close_signal live
1 0 0 0
2 1 0 1
3 0 0 1
4 0 0 1
5 0 1 0
6 0 0 0
I want to optimise the calculation of the live
column.
Is there a way to vectorise this problem in either Pandas or Numpy for improved performance, generating the same result as the for loop example below?
import pandas as pd
from datetime import datetime
example = {'date': [str(datetime(2017,1,1)), str(datetime(2017,1,2)),str(datetime(2017,1,3)),str(datetime(2017,1,4)),str(datetime(2017,1,5)),str(datetime(2017,1,6)),
str(datetime(2017,1,7)), str(datetime(2017,1,8)),str(datetime(2017,1,9)), str(datetime(2017,1,10)),str(datetime(2017,1,11)), str(datetime(2017,1,12)),
str(datetime(2017,1,13)),str(datetime(2017,1,14))],
'open': [142.11, 142.87, 141.87, 142.11, 142.00, 142.41, 142.50, 142.75, 140.87, 141.25, 141.10, 141.15, 142.55, 142.75],
'close': [142.87, 141.87, 142.11, 142.00, 142.41, 142.50, 142.75, 140.87, 141.25, 141.10, 141.15, 142.55, 142.75, 142.11],
'open_signal': [False, False, False, False, False, True, False, False, False, False, False, False, False, False],
'close_signal':[False, False, False, False, False, False, False, False, False, True, False, False, False, False]
}
data = pd.DataFrame(example)
in_trade = False
for i in data.iterrows():
if i[1].open_signal:
in_trade = True
if i[1].close_signal:
in_trade = False
data.loc[i[0],'in_trade'] = in_trade
Upvotes: 3
Views: 110
Reputation: 221664
For the simplistic case as in the posted sample, here's one vectorized way with NumPy -
ar = np.zeros(len(data), dtype=int)
ar[data.open_signal.values] = 1
ar[data.close_signal.values] = -1
data['out'] = ar.cumsum().astype(bool)
Runtime test -
Using the sample dataset and scaling it 100000
times along the rows for the testing.
In [191]: data = pd.concat([data]*100000,axis=0); data.index = range(len(data))
# @Dark's soln with int output
In [192]: %timeit data['new'] = data['open_signal'].cumsum().ne(data['close_signal'].cumsum()).astype(int)
100 loops, best of 3: 13.4 ms per loop
# @Dark's soln with bool output
In [194]: %timeit data['new'] = data['open_signal'].cumsum().ne(data['close_signal'].cumsum()).astype(bool)
100 loops, best of 3: 10 ms per loop
# Proposed in this post
In [195]: %%timeit
...: ar = np.zeros(len(data), dtype=int)
...: ar[data.open_signal.values] = 1
...: ar[data.close_signal.values] = -1
...: data['out'] = ar.cumsum().astype(bool)
100 loops, best of 3: 7.52 ms per loop
Now, to solve for a generic case with :
1] A closed signal coming after no prior open signal.
2] Multiple open signals coming prior to the next close signal.
3] Multiple close signals coming prior to the next open signal.
We would need few more steps.
Approach #1 : Here's one based upon searchsorted
-
s0 = np.flatnonzero(data.open_signal.values)
s1 = np.flatnonzero(data.close_signal.values)
idx0 = np.searchsorted(s1,s0,'right')
s0c = s0[np.r_[True,idx0[1:] > idx0[:-1]]]
idx1 = np.searchsorted(s0c,s1,'right')
s1c = s1[np.r_[True,idx1[1:] > idx1[:-1]]]
ar = np.zeros(len(data), dtype=int)
ar[s0c] = 1
ar[s1c] = -1
if s1c[0] < s0c[0]:
ar[s1c[0]] = 0
data['out'] = ar.cumsum().astype(bool)
Sample output -
In [360]: data
Out[360]:
close_signal open_signal out
0 False False False
1 False False False
2 True False False
3 False False False
4 False False False
5 False True True
6 False False True
7 False True True
8 False False True
9 True False False
10 False False False
11 True False False
12 False False False
13 False False False
Approach #2 : Possibly faster one as we would avoid using searchsorted
and instead leverage masking
-
mix_arr = data.open_signal.values.astype(int) - data.close_signal.values
ar = np.zeros(len(data), dtype=int)
mix_mask = mix_arr!=0
mix_val = mix_arr[mix_mask]
valid_mask = np.r_[True, mix_val[1:] != mix_val[:-1]]
ar[mix_mask] = mix_arr[mix_mask]*valid_mask
if mix_val[0] == -1:
ar[mix_mask.argmax()] = 0
data['out'] = ar.cumsum().astype(bool)
Upvotes: 4
Reputation: 30605
We can compare the cumulative sum i.e
data['new'] = data['open_signal'].cumsum().ne(data['close_signal'].cumsum()).astype(int)
row open_signal close_signal live new
0 1 0 0 0 0
1 2 1 0 1 1
2 3 0 0 1 1
3 4 0 0 1 1
4 5 0 1 0 0
5 6 0 0 0 0
Upvotes: 2