Reputation: 23567
I have a time series dataframe where there is 1 or 0 in it (true/false). I wrote a function that loops through all rows with values 1 in them. Given user defined integer parameter called n_hold
, I will set values 1 to n rows forward from the initial row.
For example, in the dataframe below I will be loop to row 2016-08-05
. If n_hold = 2
, then I will set both 2016-08-08
and 2016-08-09
to 1 too.:
2016-08-03 0
2016-08-04 0
2016-08-05 1
2016-08-08 0
2016-08-09 0
2016-08-10 0
The resulting df
will then is
2016-08-03 0
2016-08-04 0
2016-08-05 1
2016-08-08 1
2016-08-09 1
2016-08-10 0
The problem I have is this is being run 10s of thousands of times and my current solution where I am looping over rows where there are ones and subsetting is way too slow. I was wondering if there are any solutions to the above problem that is really fast.
Here is my (slow) solution, x
is the initial signal dataframe:
n_hold = 2
entry_sig_diff = x.diff()
entry_sig_dt = entry_sig_diff[entry_sig_diff == 1].index
final_signal = x * 0
for i in range(0, len(entry_sig_dt)):
row_idx = entry_sig_diff.index.get_loc(entry_sig_dt[i])
if (row_idx + n_hold) >= len(x):
break
final_signal[row_idx:(row_idx + n_hold + 1)] = 1
Upvotes: 3
Views: 229
Reputation: 862511
Completely changed answer, because working differently with consecutive 1
values:
Explanation:
Solution remove each consecutive 1
first by where
with chained boolean mask by comparing with ne
(not equal !=
) with shift
to NaN
s, forward filling them by ffill
with limit
parameter and last replace 0
back:
n_hold = 2
s = x.where(x.ne(x.shift()) & (x == 1)).ffill(limit=n_hold).fillna(0, downcast='int')
Timings and comparing outputs:
np.random.seed(123)
x = pd.Series(np.random.choice([0,1], p=(.8,.2), size=1000))
x1 = x.copy()
#print (x)
def orig(x):
n_hold = 2
entry_sig_diff = x.diff()
entry_sig_dt = entry_sig_diff[entry_sig_diff == 1].index
final_signal = x * 0
for i in range(0, len(entry_sig_dt)):
row_idx = entry_sig_diff.index.get_loc(entry_sig_dt[i])
if (row_idx + n_hold) >= len(x):
break
final_signal[row_idx:(row_idx + n_hold + 1)] = 1
return final_signal
#print (orig(x))
n_hold = 2
s = x.where(x.ne(x.shift()) & (x == 1)).ffill(limit=n_hold).fillna(0, downcast='int')
#print (s)
df = pd.concat([x,orig(x1), s], axis=1, keys=('input', 'orig', 'new'))
print (df.head(20))
input orig new
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 1 1 1
7 0 1 1
8 0 1 1
9 0 0 0
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
15 0 0 0
16 0 0 0
17 0 0 0
18 0 0 0
19 0 0 0
#check outputs
#print (s.values == orig(x).values)
Timings:
%timeit (orig(x))
24.8 ms ± 653 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit x.where(x.ne(x.shift()) & (x == 1)).ffill(limit=n_hold).fillna(0, downcast='int')
1.36 ms ± 12.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 2