Reputation: 546

Backtesting problem with Pandas and vectorization

I have this DataFrame:

df = pd.DataFrame({"val": [1, 2, 3, 5], "signal": [0, 1, 0, 0]})
df
    val signal
0   1   0
1   2   1
2   3   0
3   5   0

Then I do:

df["target"] = np.where(df.signal, df.val + 3, np.nan)
df["target"] = df.target.ffill()
df["hit"] = df.val >= df.target
df
    val signal  target  hit
0   1   0       NaN     False
1   2   1       5.0     False
2   3   0       5.0     False
3   5   0       5.0     True

To see if my target has been hit.

Here's my issue: let's say that the starting DataFrame was this:

    val signal
0   1   0
1   2   1
2   3   0
3   5   1 # <-- new signal
4   6   0 # <-- new row

If I do the same operations as before I get:

df["target"] = np.where(df.signal, df.val + 3, np.nan)
df["target"] = df.target.ffill()
df["hit"] = df.val >= df.target
df
    val signal  target  hit
0   1   0       NaN     False
1   2   1       5.0     False
2   3   0       5.0     False
3   5   1       7.0     False
4   6   0       7.0     False

Now I lost the hit on index 3, as target has been replaced by the second signal.

What I would like is for signal to not create a new target if the previous target has not been hit yet.

Desired output (example 1):

    val signal  target  hit
0   1   0       NaN     False
1   2   1       5.0     False
2   3   0       5.0     False
3   5   1       5.0     True
4   6   0       NaN     False

Desired output (example 2):

    val signal  target  hit
0   1   0       NaN     False
1   2   1       5.0     False
2   3   1       5.0     False
3   5   0       5.0     True
4   6   0       NaN     False

Desired output (example 3):

    val signal  target  hit
0   1   1       4.0     False
1   4   0       4.0     True
2   3   0       NaN     False
3   4   1       7.0     False
4   7   0       7.0     True

Desired output (example 4):

    val signal  target  hit
0   5   0       NaN     False
1   3   1       6.0     False
2   6   1       6.0     True
3   2   1       5.0     False
4   7   0       5.0     True

P.S. Ideally, this needs to be done with vectorization as I'm going to perform this operation for millions of rows.

EDIT: Just so the logic is clearer, here's the "loopy" version of the algorithm:

def loopy_way(vals: list, signals: list) -> list:
    active_trgt = None
    hits = []
    for val, signal in zip(vals, signals):

        if active_trgt:
            if val >= active_trgt: # Arbitrary logic
                hits.append(True)
                active_trgt = None
                continue

            # There's an active target, so ignore signal
            hits.append(False)
            continue

        if signal:
            active_trgt = val + 3 # Arbitrary condition
            hits.append(False) # Couldn't be otherwise
            continue

        # No signal and no active target
        hits.append(False)
    return hits

Upvotes: 0

Answers (3)

hume

Reputation: 2553

You can look at both the new target and the previous target at each signal point using the .shift method in pandas.

Tracking both will allow you to signal if either we are over the current or the previous target.

Additionally, you want to track what the largest historical value you have seen in the previous signal window. You can enumerate signal windows with df.signal.cumsum() and then group by that window enumeration to get the cummax just per signal window with df.groupby(df.signal_window).val.cummax().shift(1).

As an additional condition for non-monotonic data, you can accept the candidate target in signal rows if it is less than the previous target.

Combining these, you can get your desired output.

I calculate and store these as intermediate columns below to show how the logic works, but you don't have to store and then drop them in your code.

Note: All of this said, it may not be worth vectorizing this calculation. Using numba or similar you could get a very fast implementation in a loop with more readable/maintainable code and substantial runtime memory savings since you don't have to do all your intermediate calculations for every row at once.

import numpy as np
import pandas as pd

df1 = pd.DataFrame({
    "val": [1, 2, 3, 5, 6], "signal": [0, 1, 0, 1, 0],
})

df2 = pd.DataFrame({
    "val": [1, 2, 3, 5, 6], "signal": [0, 1, 1, 0, 0],
})

df3 = pd.DataFrame({
    "val": [1, 4, 3, 4, 7], "signal": [1, 0, 0, 1, 0],
})

df4 = pd.DataFrame({
    "val": [5, 3, 6, 2, 7], "signal": [0, 1, 1, 1, 0],
})

for df in [df1, df2, df3, df4]:
    # add candidate target at signal times
    df["candidate_target"] = np.where(df.signal, df.val + 3, np.nan)
    
    # track previous target at signal times
    df["prev_target"] = np.where(
        df.signal,
        df.candidate_target.ffill().shift(1),
        np.nan
    )
    
    # enumerate the signal windows with cumsum
    df["signal_window"] = df.signal.cumsum()
    
    # track max value we have seen in previous signal window
    df["max_to_date"] = df.groupby(df.signal_window).val.cummax().shift(1)
    
    # for signal rows, actual target is candidate if previous has been exceeded, else previous
    df["signal_target"] = np.where(
        (df.max_to_date >= df.prev_target) | df.prev_target.isnull() | (df.prev_target > df.candidate_target),
        df.candidate_target,
        df.prev_target
    )
    
    # for non-signal rows, add target only if it has not been hit
    df["non_signal_target"] = np.where(
        (df.signal == 0) & (df.max_to_date < df.signal_target.ffill()),
        df.signal_target.ffill(),
        np.nan,
    )
    
    # combine signal target and non-signal target rows
    df["target"] = df.signal_target.fillna(df.non_signal_target)
    
    # hit is where value exceeds or equal to target
    df["hit"] = df.val >= df.target
    
    # drop intermediate calculations
    df.drop(["max_to_date", "signal_target", "signal_window", "non_signal_target", "candidate_target", "prev_target"], axis=1, inplace=True)
    
    print(df)
#> val  signal  target    hit
#> 0    1       0     NaN  False
#> 1    2       1     5.0  False
#> 2    3       0     5.0  False
#> 3    5       1     5.0   True
#> 4    6       0     NaN  False
#>    val  signal  target    hit
#> 0    1       0     NaN  False
#> 1    2       1     5.0  False
#> 2    3       1     5.0  False
#> 3    5       0     5.0   True
#> 4    6       0     NaN  False
#>    val  signal  target    hit
#> 0    1       1     4.0  False
#> 1    4       0     4.0   True
#> 2    3       0     NaN  False
#> 3    4       1     7.0  False
#> 4    7       0     7.0   True
#>    val  signal  target    hit
#> 0    5       0     NaN  False
#> 1    3       1     6.0  False
#> 2    6       1     6.0   True
#> 3    2       1     5.0  False
#> 4    7       0     5.0   True

Upvotes: 1

Bill

Reputation: 11633

If I understand correctly, this is the logic you want to implement:

def transition(value, signal, prev_target, prev_hit):
    """Calculate target and hit in current time step"""
    if prev_hit:
        prev_target = np.nan
    if signal == 1:
        new_target = value + 3
        target = new_target if np.isnan(prev_target) else min(prev_target, new_target)
    else:
        target = prev_target
    hit = True if value >= target else False
    return target, hit

(PLEASE CONFIRM)

This works on the examples you provided so far (ignoring some values which I think are errors in your examples).

For example:

# Example data 3
df = pd.DataFrame({"val": [1, 4, 3, 4, 7], "signal": [1, 0, 0, 1, 0]})

# Prepare empty columns
df['target'] = None
df['hit'] = False
# Initial assumptions
target, hit = (np.nan, False)
for i, row in df.iterrows():
    target, hit = transition(row.val, row.signal, target, hit)
    df.loc[i, ['target', 'hit']] = target, hit
print(df)

Produces:

   val  signal target    hit
0    1       1      4  False
1    4       0      4   True
2    3       0    NaN  False
3    4       1      7  False
4    7       0      7   True

However, I think this requires a recursive solution due to the fact that a signal 'expires' after a hit. If I'm right, then I don't think this is vectorizable.

Upvotes: 0

Bill

Reputation: 11633

I think the difficulty here comes from the fact that the triggers are all in one column.

To make things easier, it's always best to organize all the data needed for a conditional test into one row.

To do this here we have to think about what value we need to test for a hit for each signal.

Here I calculated the 'minimum future value below the current row'. I did this by running the min function as an accumulator from the end to the beginning of the df.val column.

# Example data 1
df = pd.DataFrame({"val": [1, 2, 3, 5, 6], "signal": [0, 1, 0, 0, 0]})

from itertools import accumulate

# Calculate minimum future values
df['mf_val'] = np.fromiter(accumulate(df.val.values[::-1], min), dtype=int)[::-1]

df['hit'] = (df['val'] + 3 >= df['mf_val']).where(df.signal.astype(bool), False)

print(df)

Output in example 1:

   val  signal  mf_val    hit
0    1       0       1  False
1    2       1       2   True
2    3       0       3  False
3    5       0       5  False
4    6       0       6  False

Output in example 2:

   val  signal  mf_val    hit
0    1       0       1  False
1    2       1       2   True
2    3       1       3   True
3    5       0       5  False
4    6       0       6  False

Output in example 3:

   val  signal  mf_val    hit
0    1       1       1   True
1    4       0       3  False
2    3       0       3  False
3    4       1       4   True
4    7       0       7  False

This is not exactly the same as your desired values because it shows all hits and the hits are indicated in the same row as the corresponding signal. But at least it doesn't 'erase' the first hit. If you only want the first hit, use df.hit.tolist().index(True).

UPDATE

I think this does what you want:

# Example data 4
df4 = pd.DataFrame({"val": [5, 3, 4, 2, 7], "signal": [0, 1, 1, 1, 0]})

df['target'] = np.minimum.accumulate((df.val + 3).where(df.signal.astype(bool), np.inf))
df['hit'] = df.val >= df.target
print(df)

Output in example 4:

   val  signal  target    hit
0    5       0     inf  False
1    3       1     6.0  False
2    4       1     6.0  False
3    2       1     5.0  False
4    7       0     5.0   True

Upvotes: 0

Backtesting problem with Pandas and vectorization

Answers (3)

Related Questions