Reputation: 3061
I have a data.table
containing a comparison between a curve representing real data dt$real
and another curve representing a lower-bound estimate dt$lower
of that data. The table contains:
dt$date
)dt$real
)dt$lower
)dt$isLocalMax
) or minimum (dt$isLocalMin
)The real data is very noisy, so I've used a heuristic to identify these "relevant" local maxima and minima, which is a small subset of all extrema.
I want to find the first point (per "cycle") where the estimator is underestimating the real data (i.e. where the real data is lower than estimated), but only if that datapoint comes after a local maximum.
I can trivially add an indentifier for when the estimator is underwater:
dt[, underwater := (real - lower < 0)
I can then create a run-identifer on underwater
:
dt[, uwRunID := rleid(underwater)]
I can then group by that ID and get the first row for each group:
dt[dt[underwater == TRUE, .I[1], by = uwRunID]$V1]
However, given the real data is noisy, it may move between "underwater" and "above water" multiple times before reaching the relevant minima. In such a case, I'd only want to select the first time it went underwater and discard every other instance, but the code above would return every dip underwater.
I considered adding another run-ID for the minima:
dt[, minRunID := rleid(isLocalMin)]
dt[dt[underwater == TRUE, .I[1], by = minRunID]$V1]
This actually eliminates that problem: it only collects the first underwater datapoint before each local minimum.
However, there's still another problem: if there's at least one more underwater point after the minimum, it'll also be collected. Since I only want values on the downhill, such points shouldn't be included.
So I've also created yet another runID for the maxima. However, no matter what I try, I can't figure out how to get it to work.
So, with the following data representing a single cycle, only one row should be returned:
dt <- data.table(date = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15),
real = c(1, 3, 4, 6, 3, 3, 1, 0, 1, 2, 5, 4, 6, 7, 5),
lower = c(0, 2, 3, 5, 4, 2, 2, 2, 2, 3, 4, 5, 5, 6, 4),
isLocalMax = c(F, F, F, T, F, F, F, F, F, F, F, F, F, T, F),
isLocalMin = c(F, F, F, F, F, F, F, T, F, F, F, F, F, F, F))
In summary, the conditions are:
dt$isLocalMax
and dt$isLocalMin
), identify the first (if any) point where the real data is lower than the estimated lower bound.date == 4
to the minimum at date == 8
, the first time the real value goes underwater is at date == 5
. It then goes back to positive at date == 6
before going underwater again at date == 7
. We only care about the first time it dips, so the only row which should be selected is date == 5
.date == 12
, but since that's on the uphill path from minimum to maximum, we don't care.Therefore, the expected output in this case is:
# date real lower
# 1: 5 3 4
Evidently, a larger dataset with more maxima and minima would return more than one row (assuming the real value ever goes underwater in any other cycles).
Upvotes: 4
Views: 130
Reputation: 17299
Hope I didn't misunderstand your purpose. Does this work for your data:
library(data.table)
# for each row, determine the row index of previous localMax
dt[, gmax := ave(seq_len(.N), cumsum(isLocalMax), FUN = function(x) x[1])]
# for each row, determine the row index of next localMin
dt[, gmin := ave(seq_len(.N), rev(cumsum(rev(isLocalMin))), FUN = function(x) x[length(x)])]
# filter rows and keep the first record for each gmax
dt[, .SD[gmin == gmin[1]], by = .(gmax)][ # these two lines locate
gmax < gmin & real < lower][ # max to min cycle and find where real < lower
!duplicated(gmax), .(date, real, lower)]
# results
# date real lower
# 1: 5 3 4
Upvotes: 1
Reputation: 5059
Here's one option that I think can be optimized, though I'm not entirely sure if it can generalize.
first_extrema <- dt[, .(min = which.max(isLocalMin), max = which.max(isLocalMax))]
if (first_extrema$min < first_extrema$max) {
dt[-(1:(first_extrema$min)),
c("min_rleid", "max_rleid") := lapply(.SD, rleid),
.SDcols = c("isLocalMin", "isLocalMax")]
} else {
dt[, c("min_rleid", "max_rleid") := lapply(.SD, rleid),
.SDcols = c("isLocalMin", "isLocalMax")]
}
dates_min <- dt[isLocalMin == TRUE, .(date = date[1L]), by = "min_rleid"][!is.na(min_rleid)]
dates_max <- dt[isLocalMax == TRUE, .(date = date[1L]), by = "max_rleid"][!is.na(max_rleid)]
downhill <- dates_min[dates_max, .(start = i.date, end = x.date), on = .(min_rleid == max_rleid)
][!is.na(end)]
dt[, join_date := date]
under <- dt[downhill,
.(underwater = date[which(real < lower)[1L]]),
on = .(join_date >= start, join_date <= end),
by = .EACHI]
ans <- dt[under, .(date, real, lower), on = .(date = underwater)]
ans
date real lower
1: 5 3 4
The code until downhill
is simply trying to find the start and end dates of a downhill period.
We first assign a pair of rleid
s based on isLocal*
.
These ids should be equal at the endpoints of a downhill period,
that's why we exclude some rows inside the first if
's branch if a local minimum appears first,
otherwise we would get uphill periods.
We then get the ids with minima/maxima and their dates (dates_min
and dates_max
),
and then join the tables based on said ids.
In this example, downhill
contains:
start end
1: 4 8
After that we use a non-equi join as a way of grouping each period
(by using by = .EACHI
),
and we can check each group and search for the first date where real < lower
.
If that never happens, which(real < lower)[1L]
should return NA
.
Finally we look for the remaining columns in dt
based on the dates contained in under
.
Let me know if it works for your actual data, I'm having trouble figuring out if there are edge cases.
Upvotes: 0