Wasabi
Wasabi

Reputation: 3061

Complex selection of data.table rows

I have a data.table containing a comparison between a curve representing real data dt$real and another curve representing a lower-bound estimate dt$lower of that data. The table contains:

The real data is very noisy, so I've used a heuristic to identify these "relevant" local maxima and minima, which is a small subset of all extrema.

I want to find the first point (per "cycle") where the estimator is underestimating the real data (i.e. where the real data is lower than estimated), but only if that datapoint comes after a local maximum.

I can trivially add an indentifier for when the estimator is underwater:

dt[, underwater := (real - lower < 0)

I can then create a run-identifer on underwater:

dt[, uwRunID := rleid(underwater)]

I can then group by that ID and get the first row for each group:

dt[dt[underwater == TRUE, .I[1], by = uwRunID]$V1]

However, given the real data is noisy, it may move between "underwater" and "above water" multiple times before reaching the relevant minima. In such a case, I'd only want to select the first time it went underwater and discard every other instance, but the code above would return every dip underwater.

I considered adding another run-ID for the minima:

dt[, minRunID := rleid(isLocalMin)]
dt[dt[underwater == TRUE, .I[1], by = minRunID]$V1]

This actually eliminates that problem: it only collects the first underwater datapoint before each local minimum.

However, there's still another problem: if there's at least one more underwater point after the minimum, it'll also be collected. Since I only want values on the downhill, such points shouldn't be included.

So I've also created yet another runID for the maxima. However, no matter what I try, I can't figure out how to get it to work.

So, with the following data representing a single cycle, only one row should be returned:

dt <- data.table(date  = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15),
                 real  = c(1, 3, 4, 6, 3, 3, 1, 0, 1, 2,  5,  4,  6,  7,  5),
                 lower = c(0, 2, 3, 5, 4, 2, 2, 2, 2, 3,  4,  5,  5,  6,  4),
                 isLocalMax = c(F, F, F, T, F, F, F, F, F, F, F, F, F, T, F),
                 isLocalMin = c(F, F, F, F, F, F, F, T, F, F, F, F, F, F, F))

enter image description here

In summary, the conditions are:

Therefore, the expected output in this case is:

#    date real lower
# 1:    5    3     4

Evidently, a larger dataset with more maxima and minima would return more than one row (assuming the real value ever goes underwater in any other cycles).

Upvotes: 4

Views: 130

Answers (2)

mt1022
mt1022

Reputation: 17299

Hope I didn't misunderstand your purpose. Does this work for your data:

library(data.table)
# for each row, determine the row index of previous localMax
dt[, gmax := ave(seq_len(.N), cumsum(isLocalMax), FUN = function(x) x[1])]
# for each row, determine the row index of next localMin
dt[, gmin := ave(seq_len(.N), rev(cumsum(rev(isLocalMin))), FUN = function(x) x[length(x)])]
# filter rows and keep the first record for each gmax
dt[, .SD[gmin == gmin[1]], by = .(gmax)][   # these two lines locate
    gmax < gmin & real < lower][            # max to min cycle and find where real < lower
        !duplicated(gmax), .(date, real, lower)]

# results
#    date real lower
# 1:    5    3     4

Upvotes: 1

Alexis
Alexis

Reputation: 5059

Here's one option that I think can be optimized, though I'm not entirely sure if it can generalize.

first_extrema <- dt[, .(min = which.max(isLocalMin), max = which.max(isLocalMax))]

if (first_extrema$min < first_extrema$max) {
  dt[-(1:(first_extrema$min)),
     c("min_rleid", "max_rleid") := lapply(.SD, rleid),
     .SDcols = c("isLocalMin", "isLocalMax")]
} else {
  dt[, c("min_rleid", "max_rleid") := lapply(.SD, rleid),
     .SDcols = c("isLocalMin", "isLocalMax")]
}

dates_min <- dt[isLocalMin == TRUE, .(date = date[1L]), by = "min_rleid"][!is.na(min_rleid)]
dates_max <- dt[isLocalMax == TRUE, .(date = date[1L]), by = "max_rleid"][!is.na(max_rleid)]

downhill <- dates_min[dates_max, .(start = i.date, end = x.date), on = .(min_rleid == max_rleid)
                      ][!is.na(end)]

dt[, join_date := date]
under <- dt[downhill,
            .(underwater = date[which(real < lower)[1L]]),
            on = .(join_date >= start, join_date <= end),
            by = .EACHI]

ans <- dt[under, .(date, real, lower), on = .(date = underwater)]
ans
   date real lower
1:    5    3     4

The code until downhill is simply trying to find the start and end dates of a downhill period. We first assign a pair of rleids based on isLocal*. These ids should be equal at the endpoints of a downhill period, that's why we exclude some rows inside the first if's branch if a local minimum appears first, otherwise we would get uphill periods. We then get the ids with minima/maxima and their dates (dates_min and dates_max), and then join the tables based on said ids. In this example, downhill contains:

   start end
1:     4   8

After that we use a non-equi join as a way of grouping each period (by using by = .EACHI), and we can check each group and search for the first date where real < lower. If that never happens, which(real < lower)[1L] should return NA.

Finally we look for the remaining columns in dt based on the dates contained in under.

Let me know if it works for your actual data, I'm having trouble figuring out if there are edge cases.

Upvotes: 0

Related Questions