Delete rows based on multiple conditions, but the script does not work

Question

I would like to subset/filter a new dataframe based on multiple conditions. I tried the following code mentioned here (Subset data frame based on multiple conditions) and (Remove group from data.frame if at least one group member meets condition)

A small portion of total database:

df<- structure(list(pat_id = c(10302, 10302, 10302, 
                          10482, 10482,10482,
                          10613, 10613, 10613, 
                          16190, 16190, 16190, 
                          16220, 16220,16220, 16220, 16220, 16220, 16220, 16220), 
               date = c("2014-04-22","2018-12-13", "2020-07-27", "2019-07-15", "2019-09-19", "2019-09-23", 
                         "2015-09-29", "2015-10-06", "2015-11-20", "2013-07-08", "2018-01-30", 
                         "2020-01-09", "2016-06-15", "2018-02-23", "2019-02-14", "2019-08-09", 
                         "2020-03-02", "2020-07-03", "2020-11-09", "2020-12-16"), 
               number = c(1,2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 8), 
               col1 = c(0,1, 1, 2, 4, 4, 9, 3, 1, 0, 1, 1, 9, 9, 9, 9, 9, 9, 9, 9), 
               col2 = c(NA_real_,NA_real_, NA_real_, 0, 1, NA_real_, NA_real_, NA_real_, 
                        NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
                        NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), 
               col3 = c(NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
                        NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
                        NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)), 
                class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -20L), groups = structure(list(
               pat_id = c(10302, 10482, 10613, 16190, 16220), .rows = structure(list(
                        1:3, 4:6, 7:9, 10:12, 13:20), ptype = integer(0), class = c("vctrs_list_of", 
                        "vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
                        ), row.names = c(NA, -5L), .drop = TRUE))

Each row represents a 'contact', the column 'number' is the contactnumber. So each pat_id has 1 or more contacts. The columns col1-3 represent the treatment(-options). If a pat_id has treatment option 1 in col 1-3 in the first or second contact (number 1 or 2), i want to remove that pat_id value from the dataset.

I tried to create a new dataframe based on the following conditions.

If the number is 1 or 2 AND col1, col2 or col3 is 1, then delete all the rows with the corresponding id value.

Desired output:

id        date    number    col1     col2     col3
10613      ..      1         9        NA       NA
10613      ..      2         3        NA       NA
10613      ..      3         1        NA       NA
etc

I've tried df1 <- df %>% group_by(pat_id) %>% filter(any(!(number <= 2 & (col1 == 1 | col2==1 | col3==1))))

But this does not seem to work. Could it be because of the class/structure of the dataframe? I cant figure it out. If i create a 'dummy' dataframe with similar columns this code does work. But not on the big dataset.

Any tips?

jpsmith · Accepted Answer

The comments were correct that the NAs are throwing you off a bit. If you switch from filter to mutate to see what the unnegated logic is doing, you will see that instead of a boolean vector, it only creates TRUE values, with those not meeting the conditions NA - this is why negating (!) is problematic here:

  group_by(pat_id) %>% 
  mutate(see = any(number %in% 1:2 & across(col1:col3, ~.x == 1)))

#   pat_id date       number  col1  col2  col3 see  
#                
# 1  10302 2014-04-22      1     0    NA    NA TRUE 
# 2  10302 2018-12-13      2     1    NA    NA TRUE 
# 3  10302 2020-07-27      3     1    NA    NA TRUE 
# 4  10482 2019-07-15      1     2     0    NA TRUE 
# 5  10482 2019-09-19      2     4     1    NA TRUE 
# 6  10482 2019-09-23      3     4    NA    NA TRUE 
# 7  10613 2015-09-29      1     9    NA    NA NA   
# 8  10613 2015-10-06      2     3    NA    NA NA   
# 9  10613 2015-11-20      3     1    NA    NA NA

You can quickly account for this using is.na() as a little trick:

df %>% 
  group_by(pat_id) %>% 
  filter(is.na(any(number %in% 1:2 & across(col1:col3, ~.x == 1))))

Output:

#   pat_id date       number  col1  col2  col3
#               
# 1  10613 2015-09-29      1     9    NA    NA
# 2  10613 2015-10-06      2     3    NA    NA
# 3  10613 2015-11-20      3     1    NA    NA
# 4  16220 2016-06-15      1     9    NA    NA
# 5  16220 2018-02-23      2     9    NA    NA
# 6  16220 2019-02-14      3     9    NA    NA
# 7  16220 2019-08-09      4     9    NA    NA
# 8  16220 2020-03-02      5     9    NA    NA
# 9  16220 2020-07-03      6     9    NA    NA
#10  16220 2020-11-09      7     9    NA    NA
#11  16220 2020-12-16      8     9    NA    NA

Delete rows based on multiple conditions, but the script does not work

Answers (2)

Related Questions