a_leemo
a_leemo

Reputation: 71

R - Subset rows matching dissimilar start & end conditions

I have some timeseries data where I'm interested in capturing the consecutive periods that the data drops below a value x until the value goes above y (where y > x). The challenge is that the values might go above and below x several times during this period.

There's likely to be several of these consecutive, non-overlapping periods in a given dataset.

A basic example:

row     timestamp         value
1   2018-01-11 11:23:56   49.829
2   2018-01-11 11:24:00   49.803
3   2018-01-11 11:24:04   49.793
4   2018-01-11 11:24:08   49.813
5   2018-01-11 11:24:11   49.844
6   2018-01-11 11:24:15   49.830
7   2018-01-11 11:24:19   49.792
8   2018-01-11 11:24:23   49.777
9   2018-01-11 11:24:27   49.810
10  2018-01-11 11:24:31   49.843
11  2018-01-11 11:24:35   49.867
12  2018-01-11 11:24:39   49.913
13  2018-01-11 11:24:43   49.925

So in the above example my result would be rows 3-12. I'd like to exclude overlapping period of rows 7-12 for example.

I've played around a lot and struggling to get anything to work. The most logical approach seems to be to establish a counter that initiates when the values drop below 49.8 and doesn't stop until the value is above 49.9. But I'm not sure how to implement this.

Any help greatly appreciated!

Upvotes: 2

Views: 144

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 270428

We can encode the implied finite state machine into a regular expression. This uses no packages, has straight forward logic and runs quickly (see benchmark).

Using the inputs defined reproducibly in the Note at the end (we have expanded the question's input to have two stretches) create an indicator, ind, that is

  • 0 for less than x,
  • 1 if equal to or greater than x but less than y and
  • 2 for greater than or equal to y.

Then convert that to a character string and use gregexpr to find the stretches from 0 consisting of 0's and 1's.

ind <- (DF2$value >= x) + (DF2$value >= y)
g <- gregexpr("0[01]*", paste(ind, collapse = ""))[[1]]
if (min(g) == -1) g <- c()
res <- data.frame(start = as.integer(g), end = as.integer(g) + attr(g, "match.length"))

giving:

res
##   start end
## 1     3  12
## 2    16  25

The question did not specify the form of output so if you wanted a 0/1 vector instead then this will convert the above output to such a vector:

with(res, sapply(seq_along(ind), function(i) +any(i >= start & i <= end)))
## [1] 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 0

Benchmark

If the data is large and performance is of concern then these run very fast.

library(microbenchmark)
library(purrr)
library(dplyr)

microbenchmark(
regex = {
  ind <- (DF2$value >= x) + (DF2$value >= y)
  g <- gregexpr("0[01]*", paste(ind, collapse = ""))[[1]]
  data.frame(start = as.integer(g), end = as.integer(g) + attr(g, "match.length")) 
},
regex2 = {
  ind <- (DF2$value >= x) + (DF2$value >= y)
  g <- gregexpr("0[01]*", paste(ind, collapse = ""))[[1]]
  res <- data.frame(start = as.integer(g), end = as.integer(g) + attr(g, "match.length")) 
  with(res, sapply(seq_along(ind), function(i) +any(i >= start & i <= end)))
},
accum = {
  DF2 %>%
  mutate(counter = accumulate(value,.init = FALSE, 
                              ~{ if (.y <= x & !.x) {TRUE} else if (.y <= y ) {.x} else FALSE })[-1])
})

## Unit: milliseconds
##    expr    min      lq      mean   median       uq     max neval cld
##   regex 1.0019 1.10215  1.209319  1.19970  1.24970  2.3949   100 a  
##  regex2 1.3651 1.46005  1.599009  1.54315  1.65880  2.8078   100  b 
##   accum 8.8840 9.95140 10.492953 10.34490 10.86335 13.5756   100   c

Note

We have expanded the input so that there are two stretches to be matched.

DF <- structure(list(row = 1:13, timestamp = c("2018-01-11 11:23:56", 
"2018-01-11 11:24:00", "2018-01-11 11:24:04", "2018-01-11 11:24:08", 
"2018-01-11 11:24:11", "2018-01-11 11:24:15", "2018-01-11 11:24:19", 
"2018-01-11 11:24:23", "2018-01-11 11:24:27", "2018-01-11 11:24:31", 
"2018-01-11 11:24:35", "2018-01-11 11:24:39", "2018-01-11 11:24:43"
), value = c(49.829, 49.803, 49.793, 49.813, 49.844, 49.83, 49.792, 
49.777, 49.81, 49.843, 49.867, 49.913, 49.925)), 
class = "data.frame", row.names = c(NA, -13L))

DF2 <- rbind(DF, DF)

x <- 49.8
y <- 49.9

DF2 just defined is just two copies of DF. Note that we don't use row and timestamp above so we focus on value.

> DF2
   row           timestamp  value
1    1 2018-01-11 11:23:56 49.829
2    2 2018-01-11 11:24:00 49.803
3    3 2018-01-11 11:24:04 49.793
4    4 2018-01-11 11:24:08 49.813
5    5 2018-01-11 11:24:11 49.844
6    6 2018-01-11 11:24:15 49.830
7    7 2018-01-11 11:24:19 49.792
8    8 2018-01-11 11:24:23 49.777
9    9 2018-01-11 11:24:27 49.810
10  10 2018-01-11 11:24:31 49.843
11  11 2018-01-11 11:24:35 49.867
12  12 2018-01-11 11:24:39 49.913
13  13 2018-01-11 11:24:43 49.925
14   1 2018-01-11 11:23:56 49.829
15   2 2018-01-11 11:24:00 49.803
16   3 2018-01-11 11:24:04 49.793
17   4 2018-01-11 11:24:08 49.813
18   5 2018-01-11 11:24:11 49.844
19   6 2018-01-11 11:24:15 49.830
20   7 2018-01-11 11:24:19 49.792
21   8 2018-01-11 11:24:23 49.777
22   9 2018-01-11 11:24:27 49.810
23  10 2018-01-11 11:24:31 49.843
24  11 2018-01-11 11:24:35 49.867
25  12 2018-01-11 11:24:39 49.913
26  13 2018-01-11 11:24:43 49.925

Upvotes: 3

AnilGoyal
AnilGoyal

Reputation: 26238

You may also use purrr::accumulate Demo on the sample data included by @G.Grothendieck,

DF2 %>%
  mutate(counter = accumulate(value,.init = FALSE, 
                              ~{ if (.y <= x & !.x) {TRUE} else if (.y <= y ) {.x} else FALSE })[-1])

   row           timestamp  value counter
1    1 2018-01-11 11:23:56 49.829   FALSE
2    2 2018-01-11 11:24:00 49.803   FALSE
3    3 2018-01-11 11:24:04 49.793    TRUE
4    4 2018-01-11 11:24:08 49.813    TRUE
5    5 2018-01-11 11:24:11 49.844    TRUE
6    6 2018-01-11 11:24:15 49.830    TRUE
7    7 2018-01-11 11:24:19 49.792    TRUE
8    8 2018-01-11 11:24:23 49.777    TRUE
9    9 2018-01-11 11:24:27 49.810    TRUE
10  10 2018-01-11 11:24:31 49.843    TRUE
11  11 2018-01-11 11:24:35 49.867    TRUE
12  12 2018-01-11 11:24:39 49.913   FALSE
13  13 2018-01-11 11:24:43 49.925   FALSE
14   1 2018-01-11 11:23:56 49.829   FALSE
15   2 2018-01-11 11:24:00 49.803   FALSE
16   3 2018-01-11 11:24:04 49.793    TRUE
17   4 2018-01-11 11:24:08 49.813    TRUE
18   5 2018-01-11 11:24:11 49.844    TRUE
19   6 2018-01-11 11:24:15 49.830    TRUE
20   7 2018-01-11 11:24:19 49.792    TRUE
21   8 2018-01-11 11:24:23 49.777    TRUE
22   9 2018-01-11 11:24:27 49.810    TRUE
23  10 2018-01-11 11:24:31 49.843    TRUE
24  11 2018-01-11 11:24:35 49.867    TRUE
25  12 2018-01-11 11:24:39 49.913   FALSE
26  13 2018-01-11 11:24:43 49.925   FALSE

with baseR the equivalent syntax is

DF2$counter <- Reduce(\(.x, .y) { if (.y <= x & !.x) {TRUE} else if (.y <= y ) {.x} else FALSE }, DF2$value, init = FALSE, accumulate = T)[-1]

Upvotes: 1

Related Questions