Complex selection of data.table rows

Question

I have a data.table containing a comparison between a curve representing real data dt$real and another curve representing a lower-bound estimate dt$lower of that data. The table contains:

The date of each datapoint (dt$date)
The real value on that date (dt$real)
The value of the lower bound on that date (dt$lower)
Whether that value is a relevant local maximum (dt$isLocalMax) or minimum (dt$isLocalMin)

The real data is very noisy, so I've used a heuristic to identify these "relevant" local maxima and minima, which is a small subset of all extrema.

I want to find the first point (per "cycle") where the estimator is underestimating the real data (i.e. where the real data is lower than estimated), but only if that datapoint comes after a local maximum.

I can trivially add an indentifier for when the estimator is underwater:

dt[, underwater := (real - lower < 0)

I can then create a run-identifer on underwater:

dt[, uwRunID := rleid(underwater)]

I can then group by that ID and get the first row for each group:

dt[dt[underwater == TRUE, .I[1], by = uwRunID]$V1]

However, given the real data is noisy, it may move between "underwater" and "above water" multiple times before reaching the relevant minima. In such a case, I'd only want to select the first time it went underwater and discard every other instance, but the code above would return every dip underwater.

I considered adding another run-ID for the minima:

dt[, minRunID := rleid(isLocalMin)]
dt[dt[underwater == TRUE, .I[1], by = minRunID]$V1]

This actually eliminates that problem: it only collects the first underwater datapoint before each local minimum.

However, there's still another problem: if there's at least one more underwater point after the minimum, it'll also be collected. Since I only want values on the downhill, such points shouldn't be included.

So I've also created yet another runID for the maxima. However, no matter what I try, I can't figure out how to get it to work.

So, with the following data representing a single cycle, only one row should be returned:

dt <- data.table(date  = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15),
                 real  = c(1, 3, 4, 6, 3, 3, 1, 0, 1, 2,  5,  4,  6,  7,  5),
                 lower = c(0, 2, 3, 5, 4, 2, 2, 2, 2, 3,  4,  5,  5,  6,  4),
                 isLocalMax = c(F, F, F, T, F, F, F, F, F, F, F, F, F, T, F),
                 isLocalMin = c(F, F, F, F, F, F, F, T, F, F, F, F, F, F, F))

In summary, the conditions are:

For each "local maximum to minimum" cycle in the real data (where the maxima and minima are defined by dt$isLocalMax and dt$isLocalMin), identify the first (if any) point where the real data is lower than the estimated lower bound.
If, on the downhill path from maximum to minimum, the real data dips below the lower bound, and then rises above it, and then dips below it (repeated an arbitrary number of times), only the first row from the time it dipped below the lower bound that cycle should be considered. In the graph above, on the downhill path from the maximum at date == 4 to the minimum at date == 8, the first time the real value goes underwater is at date == 5. It then goes back to positive at date == 6 before going underwater again at date == 7. We only care about the first time it dips, so the only row which should be selected is date == 5.
If there are any "underwater" segments on the uphill path from minimum to maximum, these should be ignored. In the graph above, the real value goes underwater at date == 12, but since that's on the uphill path from minimum to maximum, we don't care.

Therefore, the expected output in this case is:

#    date real lower
# 1:    5    3     4

Evidently, a larger dataset with more maxima and minima would return more than one row (assuming the real value ever goes underwater in any other cycles).

mt1022 · Accepted Answer

Hope I didn't misunderstand your purpose. Does this work for your data:

library(data.table)
# for each row, determine the row index of previous localMax
dt[, gmax := ave(seq_len(.N), cumsum(isLocalMax), FUN = function(x) x[1])]
# for each row, determine the row index of next localMin
dt[, gmin := ave(seq_len(.N), rev(cumsum(rev(isLocalMin))), FUN = function(x) x[length(x)])]
# filter rows and keep the first record for each gmax
dt[, .SD[gmin == gmin[1]], by = .(gmax)][   # these two lines locate
    gmax < gmin & real < lower][            # max to min cycle and find where real < lower
        !duplicated(gmax), .(date, real, lower)]

# results
#    date real lower
# 1:    5    3     4

Complex selection of data.table rows

Answers (2)

Related Questions