Reputation: 75
I want to find all runs in a data vector where the mean value is below some threshold. E.g. for the dataset
d <- c(0.16, 0.24, 0.15, 0.17, 0.37, 0.14, 0.12, 0.08)
If I wanted to find all runs with a mean value under or equal to 0.20, the zero-indexed run 1-6 would not be identified (mean 0.205) but 1-7 (mean 0.193) would..among others.
To make things simpler I don't care about subsets of runs where the mean is already identified to be under the threshold. I.e. following the example, I would not have needed to check run 1-6 if I already knew 1-7 was below the threshold. But I would still need to check other runs which include run 1-7 and are not a subset of it (e.g. 2-8).
In an attempt to answer this question, I see that I could start with something similar to this e.g.
hour <- c(1, 2, 3, 4, 5, 6, 7, 8)
value <- c(0.16, 0.24, 0.15, 0.17, 0.37, 0.14, 0.12, 0.08)
d <- data.frame(hour, value)
rng <- rev(1:length(d$value))
data.table::setDT(d)[, paste0('MA', rng) := lapply(rng, function(x)
zoo::rollmeanr(value, x, fill = NA))][]
And then search through all the generated columns for values under the threshold.
But that method is not very efficient for what I want to achieve (it looks into all subsets of runs that are already identified under the threshold) and doesn't handle well with large datasets (meaning about 500k entries..then I would have a 500k x 500k matrix).
Instead it would suffice to record the indices of runs under the threshold in a separate variable. This would at least avoid creating a 500k x 500k matrix. But I'm not sure how to check if the output of rollmeanr()
is under a value and if so get the relevant indices.
Upvotes: 4
Views: 118
Reputation: 5704
First, note that mean(x) <= threshold
if and only if sum(x - threshold) <= 0
.
Secondly, finding the runs of d
with nonpositive sum is equivalent to finding the couples of c(0, cumsum(d))
having their second value inferior or equal to their first value.
Hence:
s <- c(0, cumsum(d - threshold))
# potential start points of *maximal* runs:
B <- which(!duplicated(cummax(s)))
# potential end points:
E <- which(!duplicated(rev(cummin(rev(s))), fromLast = TRUE))
# end point associated with each start point
# (= for each point of B, we find the *last* point of E which is smaller)
E2 <- E[findInterval(s[B], s[E])] - 1
# potential maximal runs:
df <- data.frame(begin = B, end = E2)
# now we just have to filter out lines with begin > end, and keep only the
# first begin for each end - for instance using dplyr:
df %>%
filter(begin <= end) %>%
group_by(end) %>%
summarise(begin = min(begin))
Upvotes: 3