Reputation: 41
I am looking to make rolling counts for multiple sites of instances that a threshold is exceeded.
A simplified version of my data:
Dates SiteID Value
1 2015-04-01 A 9.1
2 2015-04-02 A 8.8
3 2015-04-02 A 7.9
4 2015-04-03 A 9.2
5 2015-04-08 A 9.3
6 2015-04-11 A 8.9
7 2015-04-11 A 9.2
8 2015-04-13 A 9.1
9 2015-04-16 A 7.8
10 2015-04-01 B 9.1
11 2015-04-01 B 8.8
12 2015-04-04 B 9.9
13 2015-04-05 B 7.8
14 2015-04-06 B 9.8
15 2015-04-06 B 9.2
16 2015-04-07 B 9.1
17 2015-04-08 B 8.5
18 2015-04-15 B 9.1
If the rolling period is 3 days and the threshold for 'Value' is 9, I am looking for a new column, 'Exceedances', that counts the number of times 'Value' was greater than 9 in the last 3 days at a given 'SiteID'. So this would look like:
Dates SiteID Value Exceedances
1 2015-04-01 A 9.1 1
2 2015-04-02 A 8.8 1
3 2015-04-02 A 7.9 1
4 2015-04-03 A 9.2 2
5 2015-04-08 A 9.3 1
6 2015-04-11 A 8.9 0
7 2015-04-11 A 9.2 1
8 2015-04-13 A 9.1 2
9 2015-04-16 A 7.8 0
10 2015-04-01 B 9.1 1
11 2015-04-01 B 8.8 1
12 2015-04-04 B 9.9 1
13 2015-04-05 B 7.8 1
14 2015-04-06 B 9.8 2
15 2015-04-06 B 9.2 3
16 2015-04-07 B 9.1 3
17 2015-04-08 B 8.5 3
18 2015-04-15 B 9.1 1
DT = structure(list(r = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L), Dates = structure(c(16526, 16527,
16527, 16528, 16533, 16536, 16536, 16538, 16541, 16526, 16526,
16529, 16530, 16531, 16531, 16532, 16533, 16540), class = "Date"),
SiteID = c("A", "A", "A", "A", "A", "A", "A", "A", "A", "B",
"B", "B", "B", "B", "B", "B", "B", "B"), Value = c(9.1, 8.8,
7.9, 9.2, 9.3, 8.9, 9.2, 9.1, 7.8, 9.1, 8.8, 9.9, 7.8, 9.8,
9.2, 9.1, 8.5, 9.1), Exceedances = c(1L, 1L, 1L, 2L, 1L,
0L, 1L, 2L, 0L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 1L)), .Names = c("r",
"Dates", "SiteID", "Value", "Exceedances"), row.names = c(NA,
-18L), class = "data.frame")
I have seen similar questions that use data.table and dplyr but none have addressed counting exceedances of thresholds.
Ultimately this will be applied to very large datasets so methods that will be fastest are appreciated. And in case this makes a difference on recommendations, I will also be applying this for a rolling year rather than the 3 day example above, and the dataset will contain 'NA's.
Upvotes: 3
Views: 824
Reputation: 269624
We can use sqldf to formulate the problem as a complex left join:
library(sqldf)
sqldf("select a.*, sum(b.Value > 9) exceed
from DT a
left join DT b on a.SiteID = b.SITEID and
b.Dates > a.Dates - 3 and
b.rowid <= a.rowid
group by a.rowid")
giving:
Dates SiteID Value exceed
1 2015-04-01 A 9.1 1
2 2015-04-02 A 8.8 1
3 2015-04-02 A 7.9 1
4 2015-04-03 A 9.2 2
5 2015-04-08 A 9.3 1
6 2015-04-11 A 8.9 0
7 2015-04-11 A 9.2 1
8 2015-04-13 A 9.1 2
9 2015-04-16 A 7.8 0
10 2015-04-01 B 9.1 1
11 2015-04-01 B 8.8 1
12 2015-04-04 B 9.9 1
13 2015-04-05 B 7.8 1
14 2015-04-06 B 9.8 2
15 2015-04-06 B 9.2 3
16 2015-04-07 B 9.1 3
17 2015-04-08 B 8.5 3
18 2015-04-15 B 9.1 1
Upvotes: 1
Reputation: 66819
Since the row number seems to matter, we can add it as a column:
library(data.table)
setDT(DT)
DT[, r := rowid(SiteID)]
setcolorder(DT, c("r", setdiff(names(DT), "r")))
Then you can do a non-equi join to count rows meeting the criterion:
DT[, v :=
DT[.(SiteID = SiteID, rtop = r, d0 = Dates - 3, d1 = Dates),
on=.(SiteID, r <= rtop, Dates > d0, Dates <= d1),
sum(Value > 9), by=.EACHI]$V1
]
r Dates SiteID Value Exceedances v
1: 1 2015-04-01 A 9.1 1 1
2: 2 2015-04-02 A 8.8 1 1
3: 3 2015-04-02 A 7.9 1 1
4: 4 2015-04-03 A 9.2 2 2
5: 5 2015-04-08 A 9.3 1 1
6: 6 2015-04-11 A 8.9 0 0
7: 7 2015-04-11 A 9.2 1 1
8: 8 2015-04-13 A 9.1 2 2
9: 9 2015-04-16 A 7.8 0 0
10: 1 2015-04-01 B 9.1 1 1
11: 2 2015-04-01 B 8.8 1 1
12: 3 2015-04-04 B 9.9 1 1
13: 4 2015-04-05 B 7.8 1 1
14: 5 2015-04-06 B 9.8 2 2
15: 6 2015-04-06 B 9.2 3 3
16: 7 2015-04-07 B 9.1 3 3
17: 8 2015-04-08 B 8.5 3 3
18: 9 2015-04-15 B 9.1 1 1
There are some potential problems here:
uniqueN(x.Dates[Value > 9])
instead of sum(Value > 9)
.r
and rtop
.Regarding how it works, maybe review the vignettes and my answer to a similar question here.
Upvotes: 3
Reputation: 13122
Accounting for the fact that the order of the "Dates" column matters, a way seems to be:
thres = 9; n = 3
do.call(rbind, lapply(split(DT, DT$SiteID),
function(d) {
cs = cumsum(d$Value >= thres)
i = findInterval(d$Dates - (n - 1), d$Dates, left.open = TRUE)
cbind(d, exceed = cs - c(rep_len(0, sum(!i)), cs[i]))
}))
# r Dates SiteID Value Exceedances exceed
#A.1 1 2015-04-01 A 9.1 1 1
#A.2 2 2015-04-02 A 8.8 1 1
#A.3 3 2015-04-02 A 7.9 1 1
#A.4 4 2015-04-03 A 9.2 2 2
#A.5 5 2015-04-08 A 9.3 1 1
#A.6 6 2015-04-11 A 8.9 0 0
#A.7 7 2015-04-11 A 9.2 1 1
#A.8 8 2015-04-13 A 9.1 2 2
#A.9 9 2015-04-16 A 7.8 0 0
#B.10 1 2015-04-01 B 9.1 1 1
#B.11 2 2015-04-01 B 8.8 1 1
#B.12 3 2015-04-04 B 9.9 1 1
#B.13 4 2015-04-05 B 7.8 1 1
#B.14 5 2015-04-06 B 9.8 2 2
#B.15 6 2015-04-06 B 9.2 3 3
#B.16 7 2015-04-07 B 9.1 3 3
#B.17 8 2015-04-08 B 8.5 3 3
#B.18 9 2015-04-15 B 9.1 1 1
Upvotes: 1
Reputation: 1954
Here is an answer using data.table
. Simple, probably quick. It uses shift
to get the previous two rows' Value
, changing NA
s to zeros(for the first two in each group), gives a 1 for >9 and 0 for <9, then adds them (including 1 or 0 for current row).
library(data.table)
setDT(dt)
dt[, newCol := ifelse(shift(Value, n=1, fill=0)>9, 1,0)+ ifelse(shift(Value, n=2, fill=0)>=, 1, 0)+ ifelse(Value>9, 1, 0), by=SiteID]
per Frank's comment:
dt[, newCol := (shift(Value, n=1, fill=0)>9)+ (shift(Value, n=2, fill=0)>9) + (Value>9), by=SiteID]
also works
Upvotes: 1