Reputation: 2091
I have a dataframe that has a timestamp and a numeric variable, the data is recorded once every hour. Ultimately, I'd life to know the mean number of hours that the variable is at or below a certain value. For example, what is the average number of hours that var1
is at or below 4? There are missing timestamps in the dataframe, so if the time is not consecutive the sum needs to restart.
In the example data frame the columns HoursBelow5
and RunningGroup
were generated 'by hand', if I could create these columns programmatically, I could filter to remove the RunningGroup
s that were associate with var1
values greater than 4 and then use dplyr::slice
to get the maximum HoursBelow5
per group. I could then find the mean of these values.
So, in this approach I would need to create the restarting cumulative sum HoursBelow5
, which restarts when the condition var1
<5 is not met, or when the timestamp is not consecutive hours. I could then use ifelse
statements to create the RunningGroup
variable. Is this possible? I may be lacking the jargon to find the procedure. Cumsum
and lag
seemed promising, but I have yet to construct a procedure that does the above.
Or, there may be a smarter way to do this using the timestamp.
edit: result incorporating code from answer below
df1 <- df %>%
group_by(group = data.table::rleid(var1 > 4),
group1 = cumsum(ts - lag(ts, default = first(ts)) > 3600)) %>%
mutate(temp = row_number() * (var1 <= 4)) %>%
ungroup() %>%
filter(var1 <= 4) %>%
select(ts, var1, temp)
df2 <- df1 %>% mutate(temp2 = ifelse(temp==1, 1, 0),
newgroup = cumsum(temp2))
df3 <- df2 %>% group_by(newgroup) %>% slice(which.max(temp))
mean(df3$temp)
# example dataframe with desired output columns to then get actual output
df <- structure(list(ts = structure(c(-2208967200, -2208963600, -2208960000,
-2208956400, -2208952800, -2208949200, -2208945600, -2208942000,
-2208938400, -2208934800, -2208931200, -2208927600, -2208924000,
-2208913200, -2208909600, -2208906000, -2208902400, -2208898800,
-2208895200, -2208891600, -2208888000, -2208884400, -2208880800,
-2208877200, -2208852000, -2208848400, -2208844800, -2208841200,
-2208837600, -2208834000, -2208830400, -2208826800, -2208823200,
-2208819600, -2208816000, -2208812400, -2208808800, -2208805200,
-2208801600), class = c("POSIXct", "POSIXt"), tzone = ""), var1 = c(1L,
3L, 4L, 5L, 4L, 3L, 5L, 6L, 7L, 8L, 3L, 2L, 2L, 2L, 3L, 3L, 2L,
2L, 1L, 1L, 1L, 1L, 4L, 4L, 3L, 9L, 3L, 3L, 3L, 2L, 2L, 3L, 4L,
5L, 3L, 2L, 1L, 2L, 3L), HoursBelow5 = c(1L, 2L, 3L, 0L, 1L,
2L, 0L, 0L, 0L, 0L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 1L, 0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 0L, 1L, 2L,
3L, 4L, 5L), RunningGroup = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 5L,
6L, 7L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
10L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 13L, 14L, 14L, 14L,
14L, 14L), NotContinuous = c("", "", "", "", "", "", "", "",
"", "", "", "", "", "NC", "", "", "", "", "", "", "", "", "",
"", "NC", "", "", "", "", "", "", "", "", "", "", "", "", "",
"")), row.names = c(NA, -39L), class = "data.frame")
Upvotes: 0
Views: 264
Reputation: 389235
One way could using dplyr
and data.table::rleid
could be
library(dplyr)
df %>%
group_by(group = data.table::rleid(var1 > 4),
group1 = cumsum(ts - lag(ts, default = first(ts)) > 3600)) %>%
mutate(temp = row_number() * (var1 <= 4)) %>%
ungroup() %>%
select(ts, var1, HoursBelow5, temp)
# ts var1 HoursBelow5 temp
# <dttm> <int> <int> <int>
# 1 1900-01-01 12:46:46 1 1 1
# 2 1900-01-01 13:46:46 3 2 2
# 3 1900-01-01 14:46:46 4 3 3
# 4 1900-01-01 15:46:46 5 0 0
# 5 1900-01-01 16:46:46 4 1 1
# 6 1900-01-01 17:46:46 3 2 2
# 7 1900-01-01 18:46:46 5 0 0
# 8 1900-01-01 19:46:46 6 0 0
# 9 1900-01-01 20:46:46 7 0 0
#10 1900-01-01 21:46:46 8 0 0
# … with 29 more rows
temp
column is the one which was generated programmatically and HoursBelow5
is kept as it is for comparison purposes. If you also need RunningGroup
you could use group
and group1
together.
Upvotes: 2