Reputation: 9260
I have the following data
test_df <- tibble::tibble(
id = c(rep(1,100), rep(2,100), rep(3,100)),
count = c(rep(0,2), 22,55, rep(0,69), 33,44,55,66,77, rep(0,22),
rep(0,61), rep(250, 39),
rep(0,2), 22,55, 22,55, rep(0,94))
)
id count
<dbl> <dbl>
1 1 0
2 1 0
3 1 22
4 1 55
5 1 0
6 1 0
7 1 0
For each id I have count data (approx. 150M observations). I want to identify intervals
of at least 60 consecutive rows by id of 0 counts,
with allowance for up to 2 consecutive rows of
counts between 0 and 100 per id. The first part I can achieve using cumsum
but I struggle with the allowance part. I constructed the example data in a way that for id = 1 there should be allowed as a valid interval, while in id 3 they should not be allowed. Note that the large number of observation require probably a DT solution.
library(dplyr)
test_df %>%
group_by(id) %>%
mutate(check = cumsum(c(F, abs(diff(count)) > 0))) %>%
group_by(id, check) %>%
mutate(create_identifier = ifelse((count == 0 & n() >= 60), 1,0))
Upvotes: 1
Views: 90
Reputation: 66
Probably not the most efficient solution:
test_df %>%
#create an id for each sequence of count > 0
mutate(event_id = row_number()) %>%
mutate(event_id = case_when(
count > 0 ~ NA_integer_,
TRUE ~ event_id
)) %>%
tidyr::fill(event_id, .direction = "down") %>%
mutate(event_id = case_when(
count == 0 ~ NA_integer_,
TRUE ~ event_id
)) %>%
#check, if count-values should be considered
group_by(event_id) %>%
mutate(allowance_check = case_when(
(n() <= 2) & (id == 1) & (count <= 100) ~ 1, #either sequence of counts is shorter than two observations
count == 0 ~ 1, #or count is 0
TRUE ~ 0
)) %>%
ungroup() %>%
#create ids for grouping
group_by(id) %>%
mutate(allowance_id = row_number()) %>%
mutate(allowance_id = case_when(
allowance_id == n() ~ allowance_id, #for the last row of each group
allowance_check == 1 ~ NA_integer_,
TRUE ~ allowance_id
)) %>%
tidyr::fill(allowance_id, .direction = "up") %>%
mutate(allowance_id = case_when(
allowance_check != 1 ~ NA_integer_,
TRUE ~ allowance_id
)) %>%
ungroup()
#now just filter groups id and allowance_id for n()>= 60
Upvotes: 1