Reputation: 11
Let's say I have two columns, 'a' and 'b', of time series data in a Pandas dataframe. I want to create a third column that indicates if the difference between column 'a' at the current time period and column 'b' at any of the next 5 time periods increases by 8 or more before it decreases by 2 or more. Ideally I'd do this using some form of df.rolling(5).apply() and without any loops, but I keep running into challenges.
For demonstration's sake, I've written the logic out with loops but if anyone can give me some direction on how to do this more efficiently or elegantly, I'd really appreciate it. In reality the dataframe and the window will be much larger.
df = pd.DataFrame({'a':[1,2,3,4,5,6,7,8,9,10], 'b':[1,0,9,0,15,0,20,15,23,6]})
df['c'] = 0
window = 5
positive_thresh = 8
negative_thresh = -2
num_rows = df.shape[0]
for a_idx in range(num_rows):
a_start = df.iloc[a_idx,0]
b_roll = df.iloc[(a_idx + 1):max(a_idx + 1 + window,num_rows), 1]
deltas = b_roll - a_start
positives = deltas[deltas>=positive_thresh]
negatives = deltas[deltas<=negative_thresh]
first_pos_idx = positives.index[0] if len(positives) > 0 else num_rows
first_neg_idx = negatives.index[0] if len(negatives) > 0 else num_rows
if first_pos_idx < first_neg_idx:
df.iloc[a_idx,2] = 1
print(df)
a b c
0 1 1 1
1 2 0 0
2 3 9 0
3 4 0 1
4 5 15 0
5 6 0 1
6 7 20 1
7 8 15 1
8 9 23 0
9 10 6 0
Upvotes: 1
Views: 166
Reputation: 3280
This is a mouthful to handle with masks only, but here is one approach:
from numpy.lib.stride_tricks import sliding_window_view
window = 5
n_rows = df.shape[0]
dfa = df.reindex(np.arange(df.shape[0] + window)) # Just so that the sliding view matches
b_roll = sliding_window_view(dfa.b, 5)[1:]
diff = (b_roll.T - df.a.values).T # diff next 5 "b" rows with current "a"
pos = (diff >= 8)
pos_idx = pos.argmax(1)
pos_idx[pos.sum(1) == 0] = n_rows # differ first idx vs. no occurences found
neg = (diff <= -2)
neg_idx = window - neg[:, ::-1].argmax(1) - 1 # getting the last occurence col-wise
neg_idx[neg.sum(1) == 0] = 0 # differ first idx vs. no occurences found
df["c"] = (pos_idx < neg_idx).astype(int)
If you'll notice, my suggested output does not quite match yours. I believe your snippet does not fully represent your description, but I might just have misunderstood something in the logic.
Output:
a b c
0 1 1 0
1 2 0 1
2 3 9 1
3 4 0 1
4 5 15 0
5 6 0 0
6 7 20 0
7 8 15 1
8 9 23 0
9 10 6 0
Upvotes: 1