Reputation: 71
I have some timeseries data where I'm interested in capturing the consecutive periods that the data drops below a value x
until the value goes above y
(where y
> x
). The challenge is that the values might go above and below x several times during this period.
There's likely to be several of these consecutive, non-overlapping periods in a given dataset.
A basic example:
row timestamp value
1 2018-01-11 11:23:56 49.829
2 2018-01-11 11:24:00 49.803
3 2018-01-11 11:24:04 49.793
4 2018-01-11 11:24:08 49.813
5 2018-01-11 11:24:11 49.844
6 2018-01-11 11:24:15 49.830
7 2018-01-11 11:24:19 49.792
8 2018-01-11 11:24:23 49.777
9 2018-01-11 11:24:27 49.810
10 2018-01-11 11:24:31 49.843
11 2018-01-11 11:24:35 49.867
12 2018-01-11 11:24:39 49.913
13 2018-01-11 11:24:43 49.925
So in the above example my result would be rows 3-12. I'd like to exclude overlapping period of rows 7-12 for example.
I've played around a lot and struggling to get anything to work. The most logical approach seems to be to establish a counter that initiates when the values drop below 49.8 and doesn't stop until the value is above 49.9. But I'm not sure how to implement this.
Any help greatly appreciated!
Upvotes: 2
Views: 144
Reputation: 270428
We can encode the implied finite state machine into a regular expression. This uses no packages, has straight forward logic and runs quickly (see benchmark).
Using the inputs defined reproducibly in the Note at the end (we have expanded the question's input to have two stretches) create an indicator, ind, that is
Then convert that to a character string and use gregexpr to find the stretches from 0 consisting of 0's and 1's.
ind <- (DF2$value >= x) + (DF2$value >= y)
g <- gregexpr("0[01]*", paste(ind, collapse = ""))[[1]]
if (min(g) == -1) g <- c()
res <- data.frame(start = as.integer(g), end = as.integer(g) + attr(g, "match.length"))
giving:
res
## start end
## 1 3 12
## 2 16 25
The question did not specify the form of output so if you wanted a 0/1 vector instead then this will convert the above output to such a vector:
with(res, sapply(seq_along(ind), function(i) +any(i >= start & i <= end)))
## [1] 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 0
If the data is large and performance is of concern then these run very fast.
library(microbenchmark)
library(purrr)
library(dplyr)
microbenchmark(
regex = {
ind <- (DF2$value >= x) + (DF2$value >= y)
g <- gregexpr("0[01]*", paste(ind, collapse = ""))[[1]]
data.frame(start = as.integer(g), end = as.integer(g) + attr(g, "match.length"))
},
regex2 = {
ind <- (DF2$value >= x) + (DF2$value >= y)
g <- gregexpr("0[01]*", paste(ind, collapse = ""))[[1]]
res <- data.frame(start = as.integer(g), end = as.integer(g) + attr(g, "match.length"))
with(res, sapply(seq_along(ind), function(i) +any(i >= start & i <= end)))
},
accum = {
DF2 %>%
mutate(counter = accumulate(value,.init = FALSE,
~{ if (.y <= x & !.x) {TRUE} else if (.y <= y ) {.x} else FALSE })[-1])
})
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## regex 1.0019 1.10215 1.209319 1.19970 1.24970 2.3949 100 a
## regex2 1.3651 1.46005 1.599009 1.54315 1.65880 2.8078 100 b
## accum 8.8840 9.95140 10.492953 10.34490 10.86335 13.5756 100 c
We have expanded the input so that there are two stretches to be matched.
DF <- structure(list(row = 1:13, timestamp = c("2018-01-11 11:23:56",
"2018-01-11 11:24:00", "2018-01-11 11:24:04", "2018-01-11 11:24:08",
"2018-01-11 11:24:11", "2018-01-11 11:24:15", "2018-01-11 11:24:19",
"2018-01-11 11:24:23", "2018-01-11 11:24:27", "2018-01-11 11:24:31",
"2018-01-11 11:24:35", "2018-01-11 11:24:39", "2018-01-11 11:24:43"
), value = c(49.829, 49.803, 49.793, 49.813, 49.844, 49.83, 49.792,
49.777, 49.81, 49.843, 49.867, 49.913, 49.925)),
class = "data.frame", row.names = c(NA, -13L))
DF2 <- rbind(DF, DF)
x <- 49.8
y <- 49.9
DF2 just defined is just two copies of DF. Note that we don't use row and timestamp above so we focus on value.
> DF2
row timestamp value
1 1 2018-01-11 11:23:56 49.829
2 2 2018-01-11 11:24:00 49.803
3 3 2018-01-11 11:24:04 49.793
4 4 2018-01-11 11:24:08 49.813
5 5 2018-01-11 11:24:11 49.844
6 6 2018-01-11 11:24:15 49.830
7 7 2018-01-11 11:24:19 49.792
8 8 2018-01-11 11:24:23 49.777
9 9 2018-01-11 11:24:27 49.810
10 10 2018-01-11 11:24:31 49.843
11 11 2018-01-11 11:24:35 49.867
12 12 2018-01-11 11:24:39 49.913
13 13 2018-01-11 11:24:43 49.925
14 1 2018-01-11 11:23:56 49.829
15 2 2018-01-11 11:24:00 49.803
16 3 2018-01-11 11:24:04 49.793
17 4 2018-01-11 11:24:08 49.813
18 5 2018-01-11 11:24:11 49.844
19 6 2018-01-11 11:24:15 49.830
20 7 2018-01-11 11:24:19 49.792
21 8 2018-01-11 11:24:23 49.777
22 9 2018-01-11 11:24:27 49.810
23 10 2018-01-11 11:24:31 49.843
24 11 2018-01-11 11:24:35 49.867
25 12 2018-01-11 11:24:39 49.913
26 13 2018-01-11 11:24:43 49.925
Upvotes: 3
Reputation: 26238
You may also use purrr::accumulate
Demo on the sample data included by @G.Grothendieck,
DF2 %>%
mutate(counter = accumulate(value,.init = FALSE,
~{ if (.y <= x & !.x) {TRUE} else if (.y <= y ) {.x} else FALSE })[-1])
row timestamp value counter
1 1 2018-01-11 11:23:56 49.829 FALSE
2 2 2018-01-11 11:24:00 49.803 FALSE
3 3 2018-01-11 11:24:04 49.793 TRUE
4 4 2018-01-11 11:24:08 49.813 TRUE
5 5 2018-01-11 11:24:11 49.844 TRUE
6 6 2018-01-11 11:24:15 49.830 TRUE
7 7 2018-01-11 11:24:19 49.792 TRUE
8 8 2018-01-11 11:24:23 49.777 TRUE
9 9 2018-01-11 11:24:27 49.810 TRUE
10 10 2018-01-11 11:24:31 49.843 TRUE
11 11 2018-01-11 11:24:35 49.867 TRUE
12 12 2018-01-11 11:24:39 49.913 FALSE
13 13 2018-01-11 11:24:43 49.925 FALSE
14 1 2018-01-11 11:23:56 49.829 FALSE
15 2 2018-01-11 11:24:00 49.803 FALSE
16 3 2018-01-11 11:24:04 49.793 TRUE
17 4 2018-01-11 11:24:08 49.813 TRUE
18 5 2018-01-11 11:24:11 49.844 TRUE
19 6 2018-01-11 11:24:15 49.830 TRUE
20 7 2018-01-11 11:24:19 49.792 TRUE
21 8 2018-01-11 11:24:23 49.777 TRUE
22 9 2018-01-11 11:24:27 49.810 TRUE
23 10 2018-01-11 11:24:31 49.843 TRUE
24 11 2018-01-11 11:24:35 49.867 TRUE
25 12 2018-01-11 11:24:39 49.913 FALSE
26 13 2018-01-11 11:24:43 49.925 FALSE
with baseR the equivalent syntax is
DF2$counter <- Reduce(\(.x, .y) { if (.y <= x & !.x) {TRUE} else if (.y <= y ) {.x} else FALSE }, DF2$value, init = FALSE, accumulate = T)[-1]
Upvotes: 1