Reputation: 169
I would like to subset/filter a new dataframe based on multiple conditions. I tried the following code mentioned here (Subset data frame based on multiple conditions) and (Remove group from data.frame if at least one group member meets condition)
A small portion of total database:
df<- structure(list(pat_id = c(10302, 10302, 10302,
10482, 10482,10482,
10613, 10613, 10613,
16190, 16190, 16190,
16220, 16220,16220, 16220, 16220, 16220, 16220, 16220),
date = c("2014-04-22","2018-12-13", "2020-07-27", "2019-07-15", "2019-09-19", "2019-09-23",
"2015-09-29", "2015-10-06", "2015-11-20", "2013-07-08", "2018-01-30",
"2020-01-09", "2016-06-15", "2018-02-23", "2019-02-14", "2019-08-09",
"2020-03-02", "2020-07-03", "2020-11-09", "2020-12-16"),
number = c(1,2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 8),
col1 = c(0,1, 1, 2, 4, 4, 9, 3, 1, 0, 1, 1, 9, 9, 9, 9, 9, 9, 9, 9),
col2 = c(NA_real_,NA_real_, NA_real_, 0, 1, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
col3 = c(NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)),
class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -20L), groups = structure(list(
pat_id = c(10302, 10482, 10613, 16190, 16220), .rows = structure(list(
1:3, 4:6, 7:9, 10:12, 13:20), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), .drop = TRUE))
Each row represents a 'contact', the column 'number' is the contactnumber. So each pat_id has 1 or more contacts. The columns col1-3 represent the treatment(-options). If a pat_id has treatment option 1 in col 1-3 in the first or second contact (number 1 or 2), i want to remove that pat_id value from the dataset.
I tried to create a new dataframe based on the following conditions.
If the number is 1 or 2 AND col1, col2 or col3 is 1, then delete all the rows with the corresponding id value.
Desired output:
id date number col1 col2 col3
10613 .. 1 9 NA NA
10613 .. 2 3 NA NA
10613 .. 3 1 NA NA
etc
I've tried df1 <- df %>% group_by(pat_id) %>% filter(any(!(number <= 2 & (col1 == 1 | col2==1 | col3==1))))
But this does not seem to work. Could it be because of the class/structure of the dataframe? I cant figure it out. If i create a 'dummy' dataframe with similar columns this code does work. But not on the big dataset.
Any tips?
Upvotes: 1
Views: 68
Reputation: 3081
library(dplyr)
library(magrittr)
df %>% ungroup %>%
dplyr::filter(
!any(number %in% c(1,2) & (col1 %in% 1 | col2 %in% 1 | col3 %in% 1)),
.by = pat_id)
gives:
# A tibble: 11 × 6
pat_id date number col1 col2 col3
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 10613 2015-09-29 1 9 NA NA
2 10613 2015-10-06 2 3 NA NA
3 10613 2015-11-20 3 1 NA NA
4 16220 2016-06-15 1 9 NA NA
5 16220 2018-02-23 2 9 NA NA
6 16220 2019-02-14 3 9 NA NA
7 16220 2019-08-09 4 9 NA NA
8 16220 2020-03-02 5 9 NA NA
9 16220 2020-07-03 6 9 NA NA
10 16220 2020-11-09 7 9 NA NA
11 16220 2020-12-16 8 9 NA NA
Explanation:
filter()
retains rows matching conditions.
.by = pat_id
allows us to design a filter that applies to a whole group at a time
!any()
reads as "so long as no member of the group meets this criteria"
number %in% c(1,2) & (col1 %in% 1 | col2 %in% 1 | col3 %in% 1)
is the criteria
note that the 3 colx checks are done with %in%
rather than ==
as ==
doesn't play nicely with NA
values.
Upvotes: 1
Reputation: 17450
The comments were correct that the NA
s are throwing you off a bit. If you switch from filter
to mutate
to see
what the unnegated logic is doing, you will see that instead of a boolean vector, it only creates TRUE
values, with those not meeting the conditions NA
- this is why negating (!
) is problematic here:
group_by(pat_id) %>%
mutate(see = any(number %in% 1:2 & across(col1:col3, ~.x == 1)))
# pat_id date number col1 col2 col3 see
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl>
# 1 10302 2014-04-22 1 0 NA NA TRUE
# 2 10302 2018-12-13 2 1 NA NA TRUE
# 3 10302 2020-07-27 3 1 NA NA TRUE
# 4 10482 2019-07-15 1 2 0 NA TRUE
# 5 10482 2019-09-19 2 4 1 NA TRUE
# 6 10482 2019-09-23 3 4 NA NA TRUE
# 7 10613 2015-09-29 1 9 NA NA NA
# 8 10613 2015-10-06 2 3 NA NA NA
# 9 10613 2015-11-20 3 1 NA NA NA
You can quickly account for this using is.na()
as a little trick:
df %>%
group_by(pat_id) %>%
filter(is.na(any(number %in% 1:2 & across(col1:col3, ~.x == 1))))
Output:
# pat_id date number col1 col2 col3
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 10613 2015-09-29 1 9 NA NA
# 2 10613 2015-10-06 2 3 NA NA
# 3 10613 2015-11-20 3 1 NA NA
# 4 16220 2016-06-15 1 9 NA NA
# 5 16220 2018-02-23 2 9 NA NA
# 6 16220 2019-02-14 3 9 NA NA
# 7 16220 2019-08-09 4 9 NA NA
# 8 16220 2020-03-02 5 9 NA NA
# 9 16220 2020-07-03 6 9 NA NA
#10 16220 2020-11-09 7 9 NA NA
#11 16220 2020-12-16 8 9 NA NA
Upvotes: 5