Reputation: 879
I have a data frame with various variables. My aim is to essentially filter out rows where match1 and match2 don't meet my criteria but grouping for other variables. For example, in the data frame below I want to group by ID,name, all vars, time and time_unit. I've created a column n to show more clearly the difference between the expected output and the data.
When grouping by these you can see that there are incorrect duplicates in the columns match1 and match2, i.e. at row n = 7,10,11,13 because the columns match1 and match2 are not the same but the other columns are. I also want to include rows even if one of the match columns contains NA. Note: row 9 I want group1 to also match group3.
I've tried different approaches and couldn't find anything similar online. How do I match but also create a sort of dictionary/list on what is allowed to be matched? I.e. I would want to keep rows if match1 contains group1 and match2 contains group2 but also group3 and vice versa. My real data frame contains many more rows and I have changed the data.
The dataframe:
ID name var1 var2 var3 var4 var5 match1 time time_unit match2 n
1: 1 name1 trt1 Flexible 10.0 40 mg Group1 6 Weeks <NA> 1
2: 1 name1 trt2 Flexible NA NA <NA> Group1 6 Weeks Group1 2
3: 1 name1 trt3 Flexible 12.5 50 mg Group1 8 Weeks <NA> 3
4: 1 name1 trt1 Flexible 10.0 40 mg Group1 8 Weeks <NA> 4
5: 1 name1 trt2 Flexible NA NA <NA> Group1 8 Weeks Group1 5
6: 2 name2 trt4 Fixed 10.0 10 mg Group1 0 weeks Group1 6
7: 2 name2 trt4 Fixed 10.0 10 mg Group1 0 weeks Group2 7
8: 2 name2 trt5 Fixed 20.0 20 mg Group1 0 weeks Group1 8
9: 2 name2 trt5 Fixed 20.0 20 mg Group1 0 weeks Group3 9
10: 2 name2 trt5 Fixed 20.0 20 mg Group1 0 weeks Group2 10
11: 2 name2 trt4 Fixed 10.0 10 mg Group2 0 weeks Group1 11
12: 2 name2 trt4 Fixed 10.0 10 mg Group2 0 weeks Group2 12
13: 2 name2 trt5 Fixed 20.0 20 mg Group2 0 weeks Group1 13
14: 2 name2 trt5 Fixed 20.0 20 mg Group2 0 weeks Group2 14
15: 3 name3 trt6 Flexible 10.0 40 mg Group1 0 weeks Group1 15
16: 3 name3 trt2 Flexible NA NA <NA> Group1 0 weeks Group1 16
17: 3 name3 trt6 Flexible 10.0 40 mg Group1 8 Weeks Group1 17
18: 3 name3 trt2 Flexible NA NA <NA> Group1 8 weeks Group1 18
Expected output:
ID name var1 var2 var3 var4 var5 match1 time time_unit match2 n
1: 1 name1 trt1 Flexible 10.0 40 mg Group1 6 Weeks <NA> 1
2: 1 name1 trt2 Flexible NA NA <NA> Group1 6 Weeks Group1 2
3: 1 name1 trt3 Flexible 12.5 50 mg Group1 8 Weeks <NA> 3
4: 1 name1 trt1 Flexible 10.0 40 mg Group1 8 Weeks <NA> 4
5: 1 name1 trt2 Flexible NA NA <NA> Group1 8 Weeks Group1 5
6: 2 name2 trt4 Fixed 10.0 10 mg Group1 0 weeks Group1 6
7: 2 name2 trt5 Fixed 20.0 20 mg Group1 0 weeks Group1 8
8: 2 name2 trt5 Fixed 20.0 20 mg Group1 0 weeks Group3 9
9: 2 name2 trt4 Fixed 10.0 10 mg Group2 0 weeks Group2 12
10: 2 name2 trt5 Fixed 20.0 20 mg Group2 0 weeks Group2 14
11: 3 name3 trt6 Flexible 10.0 40 mg Group1 0 weeks Group1 15
12: 3 name3 trt2 Flexible NA NA <NA> Group1 0 weeks Group1 16
13: 3 name3 trt6 Flexible 10.0 40 mg Group1 8 weeks Group1 17
14: 3 name3 trt2 Flexible NA NA <NA> Group1 8 weeks Group1 18
This is the dput data:
structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), name = c("name1", "name1",
"name1", "name1", "name1", "name2", "name2", "name2", "name2",
"name2", "name2", "name2", "name2", "name2", "name3", "name3",
"name3", "name3", "name3"), var1 = c("trt1", "trt2", "trt3",
"trt1", "trt2", "trt4", "trt4", "trt5", "trt5", "trt5", "trt4",
"trt4", "trt5", "trt5", "trt6", "trt2", "trt6", "trt2", "trt6"
), var2 = c("Flexible", "Flexible", "Flexible", "Flexible", "Flexible",
"Fixed", "Fixed", "Fixed", "Fixed", "Fixed", "Fixed", "Fixed",
"Fixed", "Fixed", "Flexible", "Flexible", "Flexible", "Flexible",
"Flexible"), var3 = c(10, NA, 12.5, 10, NA, 10, 10, 20, 20, 20,
10, 10, 20, 20, 10, NA, 10, NA, 10), var4 = c(40L, NA, 50L, 40L,
NA, 10L, 10L, 20L, 20L, 20L, 10L, 10L, 20L, 20L, 40L, NA, 40L,
NA, 40L), var5 = c("mg", NA, "mg", "mg", NA, "mg", "mg", "mg",
"mg", "mg", "mg", "mg", "mg", "mg", "mg", NA, "mg", NA, "mg"),
match1 = c("Group1", "Group1", "Group1", "Group1", "Group1",
"Group1", "Group1", "Group1", "Group1", "Group1", "Group2",
"Group2", "Group2", "Group2", "Group1", "Group1", "Group1",
"Group1", "Group1"), time = c(6L, 6L, 8L, 8L, 8L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 8L, 8L), time_unit = c("Weeks",
"Weeks", "Weeks", "Weeks", "Weeks", "weeks", "weeks", "weeks",
"weeks", "weeks", "weeks", "weeks", "weeks", "weeks", "weeks",
"weeks", "weeks", "weeks", "Weeks"), match2 = c(NA, "Group1",
NA, NA, "Group1", "Group1", "Group2", "Group1", "Group3",
"Group2", "Group1", "Group2", "Group1", "Group2", "Group1",
"Group1", "Group1", "Group1", "Group1"), n = 1:19), row.names = c(NA,
-19L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000019a9a14a3e0>)
I've tried different approaches and couldn't find anything similar online. How do I match but also create a sort of dictionary/list on what is allowed to be matched?
Upvotes: 0
Views: 127
Reputation: 368
I would probably use a case_when to have columns match on arbitrary values. However, I'm not sure what effect the grouping is having here, so I'm not sure I fully understand what you're trying to do. Maybe this will help get you started
library(tidyverse)
df <- tibble(
ID = c(
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L
), name = c(
"name1", "name1",
"name1", "name1", "name1", "name2", "name2", "name2", "name2",
"name2", "name2", "name2", "name2", "name2", "name3", "name3",
"name3", "name3", "name3"
), var1 = c(
"trt1", "trt2", "trt3",
"trt1", "trt2", "trt4", "trt4", "trt5", "trt5", "trt5", "trt4",
"trt4", "trt5", "trt5", "trt6", "trt2", "trt6", "trt2", "trt6"
), var2 = c(
"Flexible", "Flexible", "Flexible", "Flexible", "Flexible",
"Fixed", "Fixed", "Fixed", "Fixed", "Fixed", "Fixed", "Fixed",
"Fixed", "Fixed", "Flexible", "Flexible", "Flexible", "Flexible",
"Flexible"
), var3 = c(
10, NA, 12.5, 10, NA, 10, 10, 20, 20, 20,
10, 10, 20, 20, 10, NA, 10, NA, 10
), var4 = c(
40L, NA, 50L, 40L,
NA, 10L, 10L, 20L, 20L, 20L, 10L, 10L, 20L, 20L, 40L, NA, 40L,
NA, 40L
), var5 = c(
"mg", NA, "mg", "mg", NA, "mg", "mg", "mg",
"mg", "mg", "mg", "mg", "mg", "mg", "mg", NA, "mg", NA, "mg"
),
match1 = c(
"Group1", "Group1", "Group1", "Group1", "Group1",
"Group1", "Group1", "Group1", "Group1", "Group1", "Group2",
"Group2", "Group2", "Group2", "Group1", "Group1", "Group1",
"Group1", "Group1"
), time = c(
6L, 6L, 8L, 8L, 8L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 8L
), time_unit = c(
"Weeks",
"Weeks", "Weeks", "Weeks", "Weeks", "weeks", "weeks", "weeks",
"weeks", "weeks", "weeks", "weeks", "weeks", "weeks", "weeks",
"weeks", "weeks", "weeks", "Weeks"
), match2 = c(
NA, "Group1",
NA, NA, "Group1", "Group1", "Group2", "Group1", "Group3",
"Group2", "Group1", "Group2", "Group1", "Group2", "Group1",
"Group1", "Group1", "Group1", "Group1"
), n = 1:19
)
df %>%
group_by(across(.cols = c(ID, name, starts_with("var"), time, time_unit))) %>%
mutate(custom_filter = case_when(
match1 == match2 ~ 1,
is.na(match1) | is.na(match2) ~ 1,
match1 == "Group1" & match2 == "Group3" ~ 1,
TRUE ~ 0))
#> # A tibble: 19 x 13
#> # Groups: ID, name, var1, var2, var3, var4, var5, time, time_unit [10]
#> ID name var1 var2 var3 var4 var5 match1 time time_~1 match2 n
#> <int> <chr> <chr> <chr> <dbl> <int> <chr> <chr> <int> <chr> <chr> <int>
#> 1 1 name1 trt1 Flexib~ 10 40 mg Group1 6 Weeks <NA> 1
#> 2 1 name1 trt2 Flexib~ NA NA <NA> Group1 6 Weeks Group1 2
#> 3 1 name1 trt3 Flexib~ 12.5 50 mg Group1 8 Weeks <NA> 3
#> 4 1 name1 trt1 Flexib~ 10 40 mg Group1 8 Weeks <NA> 4
#> 5 1 name1 trt2 Flexib~ NA NA <NA> Group1 8 Weeks Group1 5
#> 6 2 name2 trt4 Fixed 10 10 mg Group1 0 weeks Group1 6
#> 7 2 name2 trt4 Fixed 10 10 mg Group1 0 weeks Group2 7
#> 8 2 name2 trt5 Fixed 20 20 mg Group1 0 weeks Group1 8
#> 9 2 name2 trt5 Fixed 20 20 mg Group1 0 weeks Group3 9
#> 10 2 name2 trt5 Fixed 20 20 mg Group1 0 weeks Group2 10
#> 11 2 name2 trt4 Fixed 10 10 mg Group2 0 weeks Group1 11
#> 12 2 name2 trt4 Fixed 10 10 mg Group2 0 weeks Group2 12
#> 13 2 name2 trt5 Fixed 20 20 mg Group2 0 weeks Group1 13
#> 14 2 name2 trt5 Fixed 20 20 mg Group2 0 weeks Group2 14
#> 15 3 name3 trt6 Flexib~ 10 40 mg Group1 0 weeks Group1 15
#> 16 3 name3 trt2 Flexib~ NA NA <NA> Group1 0 weeks Group1 16
#> 17 3 name3 trt6 Flexib~ 10 40 mg Group1 0 weeks Group1 17
#> 18 3 name3 trt2 Flexib~ NA NA <NA> Group1 0 weeks Group1 18
#> 19 3 name3 trt6 Flexib~ 10 40 mg Group1 8 Weeks Group1 19
#> # ... with 1 more variable: `case_when(...)` <dbl>, and abbreviated variable
#> # name 1: time_unit
Created on 2022-11-23 with reprex v2.0.2
Upvotes: 1