Julian
Julian

Reputation: 9260

Filter with exceptions/allowance

I have the following data

test_df <- tibble::tibble(
  id = c(rep(1,100), rep(2,100), rep(3,100)),
  count = c(rep(0,2), 22,55, rep(0,69), 33,44,55,66,77, rep(0,22),
                rep(0,61), rep(250, 39), 
                rep(0,2), 22,55, 22,55, rep(0,94))
)



    id count
   <dbl> <dbl>
 1     1     0
 2     1     0
 3     1    22
 4     1    55
 5     1     0
 6     1     0
 7     1     0

For each id I have count data (approx. 150M observations). I want to identify intervals of at least 60 consecutive rows by id of 0 counts, with allowance for up to 2 consecutive rows of counts between 0 and 100 per id. The first part I can achieve using cumsum but I struggle with the allowance part. I constructed the example data in a way that for id = 1 there should be allowed as a valid interval, while in id 3 they should not be allowed. Note that the large number of observation require probably a DT solution.

library(dplyr)
test_df %>% 
  group_by(id) %>% 
  mutate(check = cumsum(c(F, abs(diff(count)) > 0))) %>% 
  group_by(id, check) %>% 
  mutate(create_identifier = ifelse((count == 0 & n() >= 60), 1,0)) 

Upvotes: 1

Views: 90

Answers (1)

opd
opd

Reputation: 66

Probably not the most efficient solution:

test_df %>% 
  #create an id for each sequence of count > 0
  mutate(event_id = row_number()) %>% 
  mutate(event_id = case_when(
    count > 0 ~ NA_integer_,
    TRUE ~ event_id
  )) %>% 
  tidyr::fill(event_id, .direction = "down") %>% 
  mutate(event_id = case_when(
    count == 0 ~ NA_integer_,
    TRUE ~ event_id
  )) %>% 
  #check, if count-values should be considered
  group_by(event_id) %>% 
  mutate(allowance_check = case_when(
    (n() <= 2) & (id == 1) & (count <= 100) ~ 1, #either sequence of counts is shorter than two observations
    count == 0 ~ 1, #or count is 0
    TRUE ~ 0
  )) %>% 
  ungroup() %>% 
  #create ids for grouping
  group_by(id) %>% 
  mutate(allowance_id = row_number()) %>% 
  mutate(allowance_id = case_when(
    allowance_id == n() ~ allowance_id, #for the last row of each group
    allowance_check == 1 ~ NA_integer_,
    TRUE ~ allowance_id
  )) %>% 
  tidyr::fill(allowance_id, .direction = "up") %>% 
  mutate(allowance_id = case_when(
    allowance_check != 1 ~ NA_integer_,
    TRUE ~ allowance_id
  )) %>% 
  ungroup()
#now just filter groups id and allowance_id for n()>= 60

Upvotes: 1

Related Questions