Reputation: 41

Rolling count of exceedances by variable over time series in R

I am looking to make rolling counts for multiple sites of instances that a threshold is exceeded.

A simplified version of my data:

        Dates SiteID Value
1  2015-04-01      A   9.1
2  2015-04-02      A   8.8
3  2015-04-02      A   7.9
4  2015-04-03      A   9.2
5  2015-04-08      A   9.3
6  2015-04-11      A   8.9
7  2015-04-11      A   9.2
8  2015-04-13      A   9.1
9  2015-04-16      A   7.8
10 2015-04-01      B   9.1
11 2015-04-01      B   8.8
12 2015-04-04      B   9.9
13 2015-04-05      B   7.8
14 2015-04-06      B   9.8
15 2015-04-06      B   9.2
16 2015-04-07      B   9.1
17 2015-04-08      B   8.5
18 2015-04-15      B   9.1

If the rolling period is 3 days and the threshold for 'Value' is 9, I am looking for a new column, 'Exceedances', that counts the number of times 'Value' was greater than 9 in the last 3 days at a given 'SiteID'. So this would look like:

        Dates SiteID Value Exceedances
1  2015-04-01      A   9.1           1
2  2015-04-02      A   8.8           1
3  2015-04-02      A   7.9           1
4  2015-04-03      A   9.2           2
5  2015-04-08      A   9.3           1
6  2015-04-11      A   8.9           0
7  2015-04-11      A   9.2           1
8  2015-04-13      A   9.1           2
9  2015-04-16      A   7.8           0
10 2015-04-01      B   9.1           1
11 2015-04-01      B   8.8           1
12 2015-04-04      B   9.9           1
13 2015-04-05      B   7.8           1
14 2015-04-06      B   9.8           2
15 2015-04-06      B   9.2           3
16 2015-04-07      B   9.1           3
17 2015-04-08      B   8.5           3
18 2015-04-15      B   9.1           1

DT = structure(list(r = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L), Dates = structure(c(16526, 16527, 
16527, 16528, 16533, 16536, 16536, 16538, 16541, 16526, 16526, 
16529, 16530, 16531, 16531, 16532, 16533, 16540), class = "Date"), 
    SiteID = c("A", "A", "A", "A", "A", "A", "A", "A", "A", "B", 
    "B", "B", "B", "B", "B", "B", "B", "B"), Value = c(9.1, 8.8, 
    7.9, 9.2, 9.3, 8.9, 9.2, 9.1, 7.8, 9.1, 8.8, 9.9, 7.8, 9.8, 
    9.2, 9.1, 8.5, 9.1), Exceedances = c(1L, 1L, 1L, 2L, 1L, 
    0L, 1L, 2L, 0L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 1L)), .Names = c("r", 
"Dates", "SiteID", "Value", "Exceedances"), row.names = c(NA, 
-18L), class = "data.frame")

I have seen similar questions that use data.table and dplyr but none have addressed counting exceedances of thresholds.

Ultimately this will be applied to very large datasets so methods that will be fastest are appreciated. And in case this makes a difference on recommendations, I will also be applying this for a rolling year rather than the 3 day example above, and the dataset will contain 'NA's.

Upvotes: 3

Answers (4)

G. Grothendieck

Reputation: 269624

We can use sqldf to formulate the problem as a complex left join:

library(sqldf)

sqldf("select a.*, sum(b.Value > 9) exceed
       from DT a
            left join DT b on a.SiteID = b.SITEID and 
                              b.Dates > a.Dates - 3 and
                              b.rowid <= a.rowid
       group by a.rowid")

giving:

        Dates SiteID Value exceed
1  2015-04-01      A   9.1      1
2  2015-04-02      A   8.8      1
3  2015-04-02      A   7.9      1
4  2015-04-03      A   9.2      2
5  2015-04-08      A   9.3      1
6  2015-04-11      A   8.9      0
7  2015-04-11      A   9.2      1
8  2015-04-13      A   9.1      2
9  2015-04-16      A   7.8      0
10 2015-04-01      B   9.1      1
11 2015-04-01      B   8.8      1
12 2015-04-04      B   9.9      1
13 2015-04-05      B   7.8      1
14 2015-04-06      B   9.8      2
15 2015-04-06      B   9.2      3
16 2015-04-07      B   9.1      3
17 2015-04-08      B   8.5      3
18 2015-04-15      B   9.1      1

Upvotes: 1

Frank

Reputation: 66819

Since the row number seems to matter, we can add it as a column:

library(data.table)
setDT(DT)

DT[, r := rowid(SiteID)]
setcolorder(DT, c("r", setdiff(names(DT), "r")))

Then you can do a non-equi join to count rows meeting the criterion:

DT[, v := 
  DT[.(SiteID = SiteID, rtop = r, d0 = Dates - 3, d1 = Dates), 
    on=.(SiteID, r <= rtop, Dates > d0, Dates <= d1), 
    sum(Value > 9), by=.EACHI]$V1
]


    r      Dates SiteID Value Exceedances v
 1: 1 2015-04-01      A   9.1           1 1
 2: 2 2015-04-02      A   8.8           1 1
 3: 3 2015-04-02      A   7.9           1 1
 4: 4 2015-04-03      A   9.2           2 2
 5: 5 2015-04-08      A   9.3           1 1
 6: 6 2015-04-11      A   8.9           0 0
 7: 7 2015-04-11      A   9.2           1 1
 8: 8 2015-04-13      A   9.1           2 2
 9: 9 2015-04-16      A   7.8           0 0
10: 1 2015-04-01      B   9.1           1 1
11: 2 2015-04-01      B   8.8           1 1
12: 3 2015-04-04      B   9.9           1 1
13: 4 2015-04-05      B   7.8           1 1
14: 5 2015-04-06      B   9.8           2 2
15: 6 2015-04-06      B   9.2           3 3
16: 7 2015-04-07      B   9.1           3 3
17: 8 2015-04-08      B   8.5           3 3
18: 9 2015-04-15      B   9.1           1 1

There are some potential problems here:

You're counting days multiple times, but probably only want to know about #days, which is uniqueN(x.Dates[Value > 9]) instead of sum(Value > 9).
I suspect there's not a good reason to care about row order here. To drop that part, just exclude the parts about r and rtop.

Regarding how it works, maybe review the vignettes and my answer to a similar question here.

Upvotes: 3

alexis_laz

Reputation: 13122

Accounting for the fact that the order of the "Dates" column matters, a way seems to be:

thres = 9; n = 3       
do.call(rbind, lapply(split(DT, DT$SiteID),
                      function(d) {
                          cs = cumsum(d$Value >= thres)
                          i = findInterval(d$Dates - (n - 1), d$Dates, left.open = TRUE)
                          cbind(d, exceed = cs - c(rep_len(0, sum(!i)), cs[i]))
                      }))
#     r      Dates SiteID Value Exceedances exceed
#A.1  1 2015-04-01      A   9.1           1      1
#A.2  2 2015-04-02      A   8.8           1      1
#A.3  3 2015-04-02      A   7.9           1      1
#A.4  4 2015-04-03      A   9.2           2      2
#A.5  5 2015-04-08      A   9.3           1      1
#A.6  6 2015-04-11      A   8.9           0      0
#A.7  7 2015-04-11      A   9.2           1      1
#A.8  8 2015-04-13      A   9.1           2      2
#A.9  9 2015-04-16      A   7.8           0      0
#B.10 1 2015-04-01      B   9.1           1      1
#B.11 2 2015-04-01      B   8.8           1      1
#B.12 3 2015-04-04      B   9.9           1      1
#B.13 4 2015-04-05      B   7.8           1      1
#B.14 5 2015-04-06      B   9.8           2      2
#B.15 6 2015-04-06      B   9.2           3      3
#B.16 7 2015-04-07      B   9.1           3      3
#B.17 8 2015-04-08      B   8.5           3      3
#B.18 9 2015-04-15      B   9.1           1      1

Upvotes: 1

moman822

Reputation: 1954

Here is an answer using data.table. Simple, probably quick. It uses shift to get the previous two rows' Value, changing NAs to zeros(for the first two in each group), gives a 1 for >9 and 0 for <9, then adds them (including 1 or 0 for current row).

library(data.table)
setDT(dt)    
dt[, newCol := ifelse(shift(Value, n=1, fill=0)>9, 1,0)+ ifelse(shift(Value, n=2, fill=0)>=, 1, 0)+ ifelse(Value>9, 1, 0), by=SiteID]

per Frank's comment:

dt[, newCol := (shift(Value, n=1, fill=0)>9)+ (shift(Value, n=2, fill=0)>9) + (Value>9), by=SiteID]

also works

Upvotes: 1

Rolling count of exceedances by variable over time series in R

Answers (4)

Related Questions