Reputation: 6091
I have a dataframe with a time series of scores. My goal is to detect when the score is larger than a certain threshold th
and then to find when the score goes back to 0. Is quite easy to find each condition separately
dates_1 = score > th
dates_2 = np.sign(score[1:]) == np.sign(score.shift(1).dropna())
However, I don't know what's the most pythonic way to override dates_2 so that only dates when an 'active' date_1 has been observed
Perhaps using an auxiliary column 'active' set to 1 whenever score > th
is True and set it to False when the condition for dates_2 is met. That way I can ask for the change in sign AND active == True
. However, that approach requires iteration and I'm wondering if there's a vectorized solution to my problem
Any thoughts on how to improve my approach?
Sample data:
date score
2010-01-04 0.0
2010-01-05 -0.3667779798467592
2010-01-06 -1.9641427199568868
2010-01-07 -0.49976215445519134
2010-01-08 -0.7069108074548405
2010-01-11 -1.4624766212523337
2010-01-12 -0.9132777669357441
2010-01-13 0.16204588193577152
2010-01-14 0.958085568609925
2010-01-15 1.4683022129399834
2010-01-19 3.036016680985081
2010-01-20 2.2357911432637345
2010-01-21 2.8827438241030707
2010-01-22 -3.395977874791837
if th = 0.94
date active
2010-01-04 False
2010-01-05 False
2010-01-06 False
2010-01-07 False
2010-01-08 False
2010-01-11 False
2010-01-12 False
2010-01-13 False
2010-01-14 True
2010-01-15 True
2010-01-19 True
2010-01-20 True
2010-01-21 True
2010-01-22 False
Upvotes: 1
Views: 1020
Reputation: 294508
def alt_cond(s, th):
active = False
for x in s:
active = [x >= th, x > 0][int(active)]
yield active
df.assign(A=[*alt_cond(df.score, 0.94)])
date score A
0 2010-01-04 0.000000 False
1 2010-01-05 -0.366778 False
2 2010-01-06 -1.964143 False
3 2010-01-07 -0.499762 False
4 2010-01-08 -0.706911 False
5 2010-01-11 -1.462477 False
6 2010-01-12 -0.913278 False
7 2010-01-13 0.162046 False
8 2010-01-14 0.958086 True
9 2010-01-15 1.468302 True
10 2010-01-19 3.036017 True
11 2010-01-20 2.235791 True
12 2010-01-21 2.882744 True
13 2010-01-22 -3.395978 False
I used Numba to really speed things up. Still a loop but should be very fast if you can install numba
from numba import njit
@njit
def alt_cond(s, th):
active = False
out = np.zeros(len(s), dtype=np.bool8)
for i, x in enumerate(s):
if active:
if x <= 0:
active = False
else:
if x >= th:
active = True
out[i] = active
return out
df.assign(A=alt_cond(df.score.values, .94))
You can have a dictionary of column names and threshold values and iterate
th = {'score': 0.94}
df.join(pd.DataFrame(
np.column_stack([[*alt_cond(df[k], v)] for k, v in th.items()]),
df.index, [f"{k}_A" for k in th]
))
date score score_A
0 2010-01-04 0.000000 False
1 2010-01-05 -0.366778 False
2 2010-01-06 -1.964143 False
3 2010-01-07 -0.499762 False
4 2010-01-08 -0.706911 False
5 2010-01-11 -1.462477 False
6 2010-01-12 -0.913278 False
7 2010-01-13 0.162046 False
8 2010-01-14 0.958086 True
9 2010-01-15 1.468302 True
10 2010-01-19 3.036017 True
11 2010-01-20 2.235791 True
12 2010-01-21 2.882744 True
13 2010-01-22 -3.395978 False
Upvotes: 2
Reputation: 957
I'm assuming your data is in a pandas dataframe, and 'date' is your index column. Then this would be the way I'd do it:
th = 0.94 # Threshold value
i = df[df.score>th].index[0] # Check the index for the first condition
df[i:][df.score<0].index[0] # Check the index for the second condition, after the index of the first condition
So use conditional indexing to find the index for the first condition ([df.score>th]
), then check for the second condition ([df.score<0]
), but begin to look from the index found for the first condition ([i:]
)
Upvotes: 2