Manselpotamus
Manselpotamus

Reputation: 75

Find rolling averages of any length under threshold

I want to find all runs in a data vector where the mean value is below some threshold. E.g. for the dataset

d <- c(0.16, 0.24, 0.15, 0.17, 0.37, 0.14, 0.12, 0.08)

If I wanted to find all runs with a mean value under or equal to 0.20, the zero-indexed run 1-6 would not be identified (mean 0.205) but 1-7 (mean 0.193) would..among others.

To make things simpler I don't care about subsets of runs where the mean is already identified to be under the threshold. I.e. following the example, I would not have needed to check run 1-6 if I already knew 1-7 was below the threshold. But I would still need to check other runs which include run 1-7 and are not a subset of it (e.g. 2-8).

In an attempt to answer this question, I see that I could start with something similar to this e.g.

hour <- c(1, 2, 3, 4, 5, 6, 7, 8)
value <- c(0.16, 0.24, 0.15, 0.17, 0.37, 0.14, 0.12, 0.08)
d <- data.frame(hour, value)

rng <- rev(1:length(d$value))

data.table::setDT(d)[, paste0('MA', rng) := lapply(rng, function(x) 
    zoo::rollmeanr(value, x, fill = NA))][]

And then search through all the generated columns for values under the threshold.

But that method is not very efficient for what I want to achieve (it looks into all subsets of runs that are already identified under the threshold) and doesn't handle well with large datasets (meaning about 500k entries..then I would have a 500k x 500k matrix).

Instead it would suffice to record the indices of runs under the threshold in a separate variable. This would at least avoid creating a 500k x 500k matrix. But I'm not sure how to check if the output of rollmeanr() is under a value and if so get the relevant indices.

Upvotes: 4

Views: 118

Answers (1)

Scarabee
Scarabee

Reputation: 5704

First, note that mean(x) <= threshold if and only if sum(x - threshold) <= 0.

Secondly, finding the runs of d with nonpositive sum is equivalent to finding the couples of c(0, cumsum(d)) having their second value inferior or equal to their first value.

Hence:

s <- c(0, cumsum(d - threshold))

# potential start points of *maximal* runs:
B <- which(!duplicated(cummax(s)))
# potential end points:
E <- which(!duplicated(rev(cummin(rev(s))), fromLast = TRUE))

# end point associated with each start point
# (= for each point of B, we find the *last* point of E which is smaller)
E2 <- E[findInterval(s[B], s[E])] - 1

# potential maximal runs:
df <- data.frame(begin = B, end = E2)

# now we just have to filter out lines with begin > end, and keep only the 
# first begin for each end - for instance using dplyr:
df %>%
  filter(begin <= end) %>%
  group_by(end) %>%
  summarise(begin = min(begin))

Upvotes: 3

Related Questions