nofunsally
nofunsally

Reputation: 2091

Sum consecutive hours when condition is met

I have a dataframe that has a timestamp and a numeric variable, the data is recorded once every hour. Ultimately, I'd life to know the mean number of hours that the variable is at or below a certain value. For example, what is the average number of hours that var1 is at or below 4? There are missing timestamps in the dataframe, so if the time is not consecutive the sum needs to restart.

In the example data frame the columns HoursBelow5 and RunningGroup were generated 'by hand', if I could create these columns programmatically, I could filter to remove the RunningGroups that were associate with var1 values greater than 4 and then use dplyr::slice to get the maximum HoursBelow5 per group. I could then find the mean of these values.

So, in this approach I would need to create the restarting cumulative sum HoursBelow5, which restarts when the condition var1<5 is not met, or when the timestamp is not consecutive hours. I could then use ifelse statements to create the RunningGroup variable. Is this possible? I may be lacking the jargon to find the procedure. Cumsum and lag seemed promising, but I have yet to construct a procedure that does the above.

Or, there may be a smarter way to do this using the timestamp.

edit: result incorporating code from answer below

df1 <- df %>%
  group_by(group = data.table::rleid(var1 > 4), 
           group1 = cumsum(ts - lag(ts, default = first(ts)) > 3600)) %>%
  mutate(temp = row_number() * (var1 <= 4)) %>%
  ungroup() %>%
  filter(var1 <= 4) %>% 
  select(ts, var1, temp)

df2 <- df1 %>% mutate(temp2 = ifelse(temp==1, 1, 0), 
                      newgroup = cumsum(temp2))

df3 <- df2 %>% group_by(newgroup) %>% slice(which.max(temp))
mean(df3$temp)

# example dataframe with desired output columns to then get actual output
df <- structure(list(ts = structure(c(-2208967200, -2208963600, -2208960000, 
-2208956400, -2208952800, -2208949200, -2208945600, -2208942000, 
-2208938400, -2208934800, -2208931200, -2208927600, -2208924000, 
-2208913200, -2208909600, -2208906000, -2208902400, -2208898800, 
-2208895200, -2208891600, -2208888000, -2208884400, -2208880800, 
-2208877200, -2208852000, -2208848400, -2208844800, -2208841200, 
-2208837600, -2208834000, -2208830400, -2208826800, -2208823200, 
-2208819600, -2208816000, -2208812400, -2208808800, -2208805200, 
-2208801600), class = c("POSIXct", "POSIXt"), tzone = ""), var1 = c(1L, 
3L, 4L, 5L, 4L, 3L, 5L, 6L, 7L, 8L, 3L, 2L, 2L, 2L, 3L, 3L, 2L, 
2L, 1L, 1L, 1L, 1L, 4L, 4L, 3L, 9L, 3L, 3L, 3L, 2L, 2L, 3L, 4L, 
5L, 3L, 2L, 1L, 2L, 3L), HoursBelow5 = c(1L, 2L, 3L, 0L, 1L, 
2L, 0L, 0L, 0L, 0L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 
9L, 10L, 11L, 1L, 0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 0L, 1L, 2L, 
3L, 4L, 5L), RunningGroup = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 5L, 
6L, 7L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 
10L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 13L, 14L, 14L, 14L, 
14L, 14L), NotContinuous = c("", "", "", "", "", "", "", "", 
"", "", "", "", "", "NC", "", "", "", "", "", "", "", "", "", 
"", "NC", "", "", "", "", "", "", "", "", "", "", "", "", "", 
"")), row.names = c(NA, -39L), class = "data.frame")

Upvotes: 0

Views: 264

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 389235

One way could using dplyr and data.table::rleid could be

library(dplyr)

df %>%
   group_by(group = data.table::rleid(var1 > 4), 
            group1 = cumsum(ts - lag(ts, default = first(ts)) > 3600)) %>%
   mutate(temp = row_number() * (var1 <= 4)) %>%
   ungroup() %>%
   select(ts, var1, HoursBelow5, temp)

#   ts                   var1 HoursBelow5  temp
#   <dttm>              <int>       <int> <int>
# 1 1900-01-01 12:46:46     1           1     1
# 2 1900-01-01 13:46:46     3           2     2
# 3 1900-01-01 14:46:46     4           3     3
# 4 1900-01-01 15:46:46     5           0     0
# 5 1900-01-01 16:46:46     4           1     1
# 6 1900-01-01 17:46:46     3           2     2
# 7 1900-01-01 18:46:46     5           0     0
# 8 1900-01-01 19:46:46     6           0     0
# 9 1900-01-01 20:46:46     7           0     0
#10 1900-01-01 21:46:46     8           0     0
# … with 29 more rows

temp column is the one which was generated programmatically and HoursBelow5 is kept as it is for comparison purposes. If you also need RunningGroup you could use group and group1 together.

Upvotes: 2

Related Questions