Reputation: 290
In keeping with a previous question, imagine I have a data set:
Date rain code
2009-04-01 0.0 0
2009-04-02 0.0 0
2009-04-03 0.0 0
2009-04-04 0.7 1
2009-04-05 54.2 1
2009-04-06 0.0 0
2009-04-07 5.0 1
2009-04-08 9.0 0
2009-04-09 0.0 0
2009-04-10 0.0 0
2009-04-11 0.0 0
2009-04-12 5.3 1
2009-04-13 10.1 1
2009-04-14 6.0 1
2009-04-15 8.7 1
2009-04-16 0.0 0
2009-04-17 0.0 0
2009-04-18 0.0 0
2009-04-19 2.0 0
2009-04-20 3.0 0
2009-04-21 0.0 0
2009-04-22 0.0 0
2009-04-23 0.0 0
2009-04-24 0.0 0
2009-04-25 4.3 1
2009-04-26 42.2 1
2009-04-27 45.6 1
2009-04-28 12.6 1
2009-04-29 6.2 1
2009-04-30 1.0 1
DT = structure(list(Date = structure(c(14335, 14336, 14337, 14338,
14339, 14340, 14341, 14342, 14343, 14344, 14345, 14346, 14347,
14348, 14349, 14350, 14351, 14352, 14353, 14354, 14355, 14356,
14357, 14358, 14359, 14360, 14361, 14362, 14363, 14364), class = "Date"),
rain = c(0, 0, 0, 0.7, 54.2, 0, 5, 9, 0, 0, 0, 5.3, 10.1,
6, 8.7, 0, 0, 0, 2, 3, 0, 0, 0, 0, 4.3, 42.2, 45.6, 12.6,
6.2, 1), code = c(0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L,
0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 1L, 1L)), .Names = c("Date", "rain", "code"), row.names = c(NA,
-30L), class = "data.frame")
I am trying to collapse the data set to get the sum of consecutive values of rain when code is 1. I need to have sum of them until the day after the event, inclusive. For example, I want to get sum of rain values from 2009-04-13 to 2009-04-06, and 2009-04-07 to 2009-04-08 separately. So I am trying to find way to define when the code is equal to 1 and the following day inclusive. The final product ought to look like:
Date rain code
2009-04-01 0.0 0
2009-04-02 0.0 0
2009-04-03 0.0 0
2009-04-06 54.9 1
2009-04-08 14.0 1
2009-04-09 0.0 0
2009-04-10 0.0 0
2009-04-11 0.0 0
2009-04-16 30.1 1
2009-04-17 0.0 0
2009-04-18 0.0 0
2009-04-19 2.0 0
2009-04-20 3.0 0
2009-04-21 0.0 0
2009-04-22 0.0 0
2009-04-23 0.0 0
2009-04-24 0.0 0
2009-04-30 111.9 1 (if last entry of data frame)
Any help on the above problem would be greatly appreciated.
Upvotes: 2
Views: 276
Reputation: 2797
If you want to use base R you can always use diff
to calculate when the rain starts and stops.
start= which(diff(df$code)==1) +1
end = c(which(diff(df$code)==-1)+1, nrow(df))
l <- mapply(":", start, end)
Getting the data to collapse is just a matter of throwing out all other non stop day indices and replacing the last day with the following, which gets the cumulative rain amount on the day that it stops raining.
lapply(l, function(x) {
df[x,][length(x),"rain"] <- sum(df[x,"rain"])
df[x,][length(x),]
})
[[1]]
Date rain code
6 2009-04-06 54.9 0
[[2]]
Date rain code
8 2009-04-08 14 0
[[3]]
Date rain code
16 2009-04-16 30.1 0
[[4]]
Date rain code
30 2009-04-30 111.9 1
Upvotes: 0
Reputation: 66819
Here's one way:
library(data.table)
setDT(DT)
res = DT[, .(
Date = Date[.N],
rain = sum(rain),
code = code[1L]
), by=.(g = cumsum(shift(!code, fill=FALSE)))]
res[, g := NULL]
Date rain code
1: 2009-04-01 0.0 0
2: 2009-04-02 0.0 0
3: 2009-04-03 0.0 0
4: 2009-04-06 54.9 1
5: 2009-04-08 14.0 1
6: 2009-04-09 0.0 0
7: 2009-04-10 0.0 0
8: 2009-04-11 0.0 0
9: 2009-04-16 30.1 1
10: 2009-04-17 0.0 0
11: 2009-04-18 0.0 0
12: 2009-04-19 2.0 0
13: 2009-04-20 3.0 0
14: 2009-04-21 0.0 0
15: 2009-04-22 0.0 0
16: 2009-04-23 0.0 0
17: 2009-04-24 0.0 0
18: 2009-04-30 111.9 1
How it works:
shift
is taking the value from the prior row!code
is added up, TRUE/FALSE are treated as 1/0.N
is the last row in the by=
groupThe general syntax is DT[, j, by]
where j
is computed using each by
subset of data.
Upvotes: 3