cgvoller
cgvoller

Reputation: 879

Filtering rows with group by and matching different strings in multiple columns

I have a data frame with various variables. My aim is to essentially filter out rows where match1 and match2 don't meet my criteria but grouping for other variables. For example, in the data frame below I want to group by ID,name, all vars, time and time_unit. I've created a column n to show more clearly the difference between the expected output and the data.

When grouping by these you can see that there are incorrect duplicates in the columns match1 and match2, i.e. at row n = 7,10,11,13 because the columns match1 and match2 are not the same but the other columns are. I also want to include rows even if one of the match columns contains NA. Note: row 9 I want group1 to also match group3.

I've tried different approaches and couldn't find anything similar online. How do I match but also create a sort of dictionary/list on what is allowed to be matched? I.e. I would want to keep rows if match1 contains group1 and match2 contains group2 but also group3 and vice versa. My real data frame contains many more rows and I have changed the data.

The dataframe:

    ID  name var1     var2 var3 var4 var5 match1 time time_unit match2  n
 1:  1 name1 trt1 Flexible 10.0   40   mg Group1    6     Weeks   <NA>  1
 2:  1 name1 trt2 Flexible   NA   NA <NA> Group1    6     Weeks Group1  2
 3:  1 name1 trt3 Flexible 12.5   50   mg Group1    8     Weeks   <NA>  3
 4:  1 name1 trt1 Flexible 10.0   40   mg Group1    8     Weeks   <NA>  4
 5:  1 name1 trt2 Flexible   NA   NA <NA> Group1    8     Weeks Group1  5
 6:  2 name2 trt4    Fixed 10.0   10   mg Group1    0     weeks Group1  6
 7:  2 name2 trt4    Fixed 10.0   10   mg Group1    0     weeks Group2  7
 8:  2 name2 trt5    Fixed 20.0   20   mg Group1    0     weeks Group1  8
 9:  2 name2 trt5    Fixed 20.0   20   mg Group1    0     weeks Group3  9
10:  2 name2 trt5    Fixed 20.0   20   mg Group1    0     weeks Group2 10
11:  2 name2 trt4    Fixed 10.0   10   mg Group2    0     weeks Group1 11
12:  2 name2 trt4    Fixed 10.0   10   mg Group2    0     weeks Group2 12
13:  2 name2 trt5    Fixed 20.0   20   mg Group2    0     weeks Group1 13
14:  2 name2 trt5    Fixed 20.0   20   mg Group2    0     weeks Group2 14
15:  3 name3 trt6 Flexible 10.0   40   mg Group1    0     weeks Group1 15
16:  3 name3 trt2 Flexible   NA   NA <NA> Group1    0     weeks Group1 16
17:  3 name3 trt6 Flexible 10.0   40   mg Group1    8     Weeks Group1 17
18:  3 name3 trt2 Flexible   NA   NA <NA> Group1    8     weeks Group1 18

Expected output:

    ID  name var1     var2 var3 var4 var5 match1 time time_unit match2  n
 1:  1 name1 trt1 Flexible 10.0   40   mg Group1    6     Weeks   <NA>  1
 2:  1 name1 trt2 Flexible   NA   NA <NA> Group1    6     Weeks Group1  2
 3:  1 name1 trt3 Flexible 12.5   50   mg Group1    8     Weeks   <NA>  3
 4:  1 name1 trt1 Flexible 10.0   40   mg Group1    8     Weeks   <NA>  4
 5:  1 name1 trt2 Flexible   NA   NA <NA> Group1    8     Weeks Group1  5
 6:  2 name2 trt4    Fixed 10.0   10   mg Group1    0     weeks Group1  6
 7:  2 name2 trt5    Fixed 20.0   20   mg Group1    0     weeks Group1  8
 8:  2 name2 trt5    Fixed 20.0   20   mg Group1    0     weeks Group3  9
 9:  2 name2 trt4    Fixed 10.0   10   mg Group2    0     weeks Group2 12
10:  2 name2 trt5    Fixed 20.0   20   mg Group2    0     weeks Group2 14
11:  3 name3 trt6 Flexible 10.0   40   mg Group1    0     weeks Group1 15
12:  3 name3 trt2 Flexible   NA   NA <NA> Group1    0     weeks Group1 16
13:  3 name3 trt6 Flexible 10.0   40   mg Group1    8     weeks Group1 17
14:  3 name3 trt2 Flexible   NA   NA <NA> Group1    8     weeks Group1 18

This is the dput data:

structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), name = c("name1", "name1", 
"name1", "name1", "name1", "name2", "name2", "name2", "name2", 
"name2", "name2", "name2", "name2", "name2", "name3", "name3", 
"name3", "name3", "name3"), var1 = c("trt1", "trt2", "trt3", 
"trt1", "trt2", "trt4", "trt4", "trt5", "trt5", "trt5", "trt4", 
"trt4", "trt5", "trt5", "trt6", "trt2", "trt6", "trt2", "trt6"
), var2 = c("Flexible", "Flexible", "Flexible", "Flexible", "Flexible", 
"Fixed", "Fixed", "Fixed", "Fixed", "Fixed", "Fixed", "Fixed", 
"Fixed", "Fixed", "Flexible", "Flexible", "Flexible", "Flexible", 
"Flexible"), var3 = c(10, NA, 12.5, 10, NA, 10, 10, 20, 20, 20, 
10, 10, 20, 20, 10, NA, 10, NA, 10), var4 = c(40L, NA, 50L, 40L, 
NA, 10L, 10L, 20L, 20L, 20L, 10L, 10L, 20L, 20L, 40L, NA, 40L, 
NA, 40L), var5 = c("mg", NA, "mg", "mg", NA, "mg", "mg", "mg", 
"mg", "mg", "mg", "mg", "mg", "mg", "mg", NA, "mg", NA, "mg"), 
    match1 = c("Group1", "Group1", "Group1", "Group1", "Group1", 
    "Group1", "Group1", "Group1", "Group1", "Group1", "Group2", 
    "Group2", "Group2", "Group2", "Group1", "Group1", "Group1", 
    "Group1", "Group1"), time = c(6L, 6L, 8L, 8L, 8L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 8L, 8L), time_unit = c("Weeks", 
    "Weeks", "Weeks", "Weeks", "Weeks", "weeks", "weeks", "weeks", 
    "weeks", "weeks", "weeks", "weeks", "weeks", "weeks", "weeks", 
    "weeks", "weeks", "weeks", "Weeks"), match2 = c(NA, "Group1", 
    NA, NA, "Group1", "Group1", "Group2", "Group1", "Group3", 
    "Group2", "Group1", "Group2", "Group1", "Group2", "Group1", 
    "Group1", "Group1", "Group1", "Group1"), n = 1:19), row.names = c(NA, 
-19L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000019a9a14a3e0>)

I've tried different approaches and couldn't find anything similar online. How do I match but also create a sort of dictionary/list on what is allowed to be matched?

Upvotes: 0

Views: 127

Answers (1)

nd37255
nd37255

Reputation: 368

I would probably use a case_when to have columns match on arbitrary values. However, I'm not sure what effect the grouping is having here, so I'm not sure I fully understand what you're trying to do. Maybe this will help get you started

library(tidyverse)

df <- tibble(
  ID = c(
    1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
    2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L
  ), name = c(
    "name1", "name1",
    "name1", "name1", "name1", "name2", "name2", "name2", "name2",
    "name2", "name2", "name2", "name2", "name2", "name3", "name3",
    "name3", "name3", "name3"
  ), var1 = c(
    "trt1", "trt2", "trt3",
    "trt1", "trt2", "trt4", "trt4", "trt5", "trt5", "trt5", "trt4",
    "trt4", "trt5", "trt5", "trt6", "trt2", "trt6", "trt2", "trt6"
  ), var2 = c(
    "Flexible", "Flexible", "Flexible", "Flexible", "Flexible",
    "Fixed", "Fixed", "Fixed", "Fixed", "Fixed", "Fixed", "Fixed",
    "Fixed", "Fixed", "Flexible", "Flexible", "Flexible", "Flexible",
    "Flexible"
  ), var3 = c(
    10, NA, 12.5, 10, NA, 10, 10, 20, 20, 20,
    10, 10, 20, 20, 10, NA, 10, NA, 10
  ), var4 = c(
    40L, NA, 50L, 40L,
    NA, 10L, 10L, 20L, 20L, 20L, 10L, 10L, 20L, 20L, 40L, NA, 40L,
    NA, 40L
  ), var5 = c(
    "mg", NA, "mg", "mg", NA, "mg", "mg", "mg",
    "mg", "mg", "mg", "mg", "mg", "mg", "mg", NA, "mg", NA, "mg"
  ),
  match1 = c(
    "Group1", "Group1", "Group1", "Group1", "Group1",
    "Group1", "Group1", "Group1", "Group1", "Group1", "Group2",
    "Group2", "Group2", "Group2", "Group1", "Group1", "Group1",
    "Group1", "Group1"
  ), time = c(
    6L, 6L, 8L, 8L, 8L, 0L, 0L,
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 8L
  ), time_unit = c(
    "Weeks",
    "Weeks", "Weeks", "Weeks", "Weeks", "weeks", "weeks", "weeks",
    "weeks", "weeks", "weeks", "weeks", "weeks", "weeks", "weeks",
    "weeks", "weeks", "weeks", "Weeks"
  ), match2 = c(
    NA, "Group1",
    NA, NA, "Group1", "Group1", "Group2", "Group1", "Group3",
    "Group2", "Group1", "Group2", "Group1", "Group2", "Group1",
    "Group1", "Group1", "Group1", "Group1"
  ), n = 1:19
)

df %>% 
    group_by(across(.cols = c(ID, name, starts_with("var"), time, time_unit))) %>% 
    mutate(custom_filter = case_when(
        match1 == match2 ~ 1,
        is.na(match1) | is.na(match2) ~ 1,
        match1 == "Group1" & match2 == "Group3" ~ 1,
        TRUE ~ 0))
#> # A tibble: 19 x 13
#> # Groups:   ID, name, var1, var2, var3, var4, var5, time, time_unit [10]
#>       ID name  var1  var2     var3  var4 var5  match1  time time_~1 match2     n
#>    <int> <chr> <chr> <chr>   <dbl> <int> <chr> <chr>  <int> <chr>   <chr>  <int>
#>  1     1 name1 trt1  Flexib~  10      40 mg    Group1     6 Weeks   <NA>       1
#>  2     1 name1 trt2  Flexib~  NA      NA <NA>  Group1     6 Weeks   Group1     2
#>  3     1 name1 trt3  Flexib~  12.5    50 mg    Group1     8 Weeks   <NA>       3
#>  4     1 name1 trt1  Flexib~  10      40 mg    Group1     8 Weeks   <NA>       4
#>  5     1 name1 trt2  Flexib~  NA      NA <NA>  Group1     8 Weeks   Group1     5
#>  6     2 name2 trt4  Fixed    10      10 mg    Group1     0 weeks   Group1     6
#>  7     2 name2 trt4  Fixed    10      10 mg    Group1     0 weeks   Group2     7
#>  8     2 name2 trt5  Fixed    20      20 mg    Group1     0 weeks   Group1     8
#>  9     2 name2 trt5  Fixed    20      20 mg    Group1     0 weeks   Group3     9
#> 10     2 name2 trt5  Fixed    20      20 mg    Group1     0 weeks   Group2    10
#> 11     2 name2 trt4  Fixed    10      10 mg    Group2     0 weeks   Group1    11
#> 12     2 name2 trt4  Fixed    10      10 mg    Group2     0 weeks   Group2    12
#> 13     2 name2 trt5  Fixed    20      20 mg    Group2     0 weeks   Group1    13
#> 14     2 name2 trt5  Fixed    20      20 mg    Group2     0 weeks   Group2    14
#> 15     3 name3 trt6  Flexib~  10      40 mg    Group1     0 weeks   Group1    15
#> 16     3 name3 trt2  Flexib~  NA      NA <NA>  Group1     0 weeks   Group1    16
#> 17     3 name3 trt6  Flexib~  10      40 mg    Group1     0 weeks   Group1    17
#> 18     3 name3 trt2  Flexib~  NA      NA <NA>  Group1     0 weeks   Group1    18
#> 19     3 name3 trt6  Flexib~  10      40 mg    Group1     8 Weeks   Group1    19
#> # ... with 1 more variable: `case_when(...)` <dbl>, and abbreviated variable
#> #   name 1: time_unit

Created on 2022-11-23 with reprex v2.0.2

Upvotes: 1

Related Questions