user1234440
user1234440

Reputation: 23567

Python Setting Values Without Loop

I have a time series dataframe where there is 1 or 0 in it (true/false). I wrote a function that loops through all rows with values 1 in them. Given user defined integer parameter called n_hold, I will set values 1 to n rows forward from the initial row.

For example, in the dataframe below I will be loop to row 2016-08-05. If n_hold = 2, then I will set both 2016-08-08 and 2016-08-09 to 1 too.:

2016-08-03    0
2016-08-04    0
2016-08-05    1
2016-08-08    0
2016-08-09    0
2016-08-10    0

The resulting df will then is

2016-08-03    0
2016-08-04    0
2016-08-05    1
2016-08-08    1
2016-08-09    1
2016-08-10    0

The problem I have is this is being run 10s of thousands of times and my current solution where I am looping over rows where there are ones and subsetting is way too slow. I was wondering if there are any solutions to the above problem that is really fast.

Here is my (slow) solution, x is the initial signal dataframe:

n_hold = 2
entry_sig_diff = x.diff()
entry_sig_dt = entry_sig_diff[entry_sig_diff == 1].index
final_signal = x * 0
for i in range(0, len(entry_sig_dt)):
    row_idx = entry_sig_diff.index.get_loc(entry_sig_dt[i])

    if (row_idx + n_hold) >= len(x):
        break

    final_signal[row_idx:(row_idx + n_hold + 1)] = 1

Upvotes: 3

Views: 229

Answers (1)

jezrael
jezrael

Reputation: 862511

Completely changed answer, because working differently with consecutive 1 values:

Explanation:

Solution remove each consecutive 1 first by where with chained boolean mask by comparing with ne (not equal !=) with shift to NaNs, forward filling them by ffill with limit parameter and last replace 0 back:

n_hold = 2
s = x.where(x.ne(x.shift()) & (x == 1)).ffill(limit=n_hold).fillna(0, downcast='int')

Timings and comparing outputs:

np.random.seed(123)
x = pd.Series(np.random.choice([0,1], p=(.8,.2), size=1000))
x1 = x.copy()
#print (x)


def orig(x):
    n_hold = 2
    entry_sig_diff = x.diff()
    entry_sig_dt = entry_sig_diff[entry_sig_diff == 1].index
    final_signal = x * 0
    for i in range(0, len(entry_sig_dt)):
        row_idx = entry_sig_diff.index.get_loc(entry_sig_dt[i])

        if (row_idx + n_hold) >= len(x):
            break

        final_signal[row_idx:(row_idx + n_hold + 1)] = 1
    return final_signal

#print (orig(x))

n_hold = 2
s = x.where(x.ne(x.shift()) & (x == 1)).ffill(limit=n_hold).fillna(0, downcast='int')
#print (s)

df = pd.concat([x,orig(x1), s], axis=1, keys=('input', 'orig', 'new'))
print (df.head(20))
    input  orig  new
0       0     0    0
1       0     0    0
2       0     0    0
3       0     0    0
4       0     0    0
5       0     0    0
6       1     1    1
7       0     1    1
8       0     1    1
9       0     0    0
10      0     0    0
11      0     0    0
12      0     0    0
13      0     0    0
14      0     0    0
15      0     0    0
16      0     0    0
17      0     0    0
18      0     0    0
19      0     0    0

#check outputs
#print (s.values == orig(x).values)

Timings:

%timeit (orig(x))
24.8 ms ± 653 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit x.where(x.ne(x.shift()) & (x == 1)).ffill(limit=n_hold).fillna(0, downcast='int')
1.36 ms ± 12.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 2

Related Questions