Using purrr to iterate over two lists and then pipe into dplyr::filter

Question

library(tidyverse)
library(purrr)

Using the sample data below, I can create the following function:

Funs <- function(DF, One, Two){

    One <- enquo(One)
    Two <- enquo(Two)

    DF %>% filter(School == (!!One) & Code == (!!Two)) %>%
        group_by(Code, School) %>%
        summarise(Count = sum(Question1))
}

I can then use the function to filter on two variables - School and Code - like this:

Funs(DF, "School1", "B344")

That's all good, but my actual data has many variables and therefore instead of having to constantly type the "School" and "Code" variables into the function, I would like to use tidyverse and the purrr package to loop over two lists (one of School, and one for Code) and feed this into filter. I would like the output to be a list of results.

For the sake of simplicity, the two lists to feed into dplyr::filter will only have two values each: School2 will go with S300, and School1 with B344, just like the example above.

Some examples that I've tried:

map2(c(“School2”, ”School1”),
     c(“S300”, ”B344”),
     function(x,y) {
         DF %>% filter(School == .x & Code == .y) %>%
             group_by(Code, School) %>%
             summarise(Count = sum(Question1))
     }

Also...

map2(c("School2", "School1")),
     c("S300","B344"),
     ~filter(School == .x & Code == .y) %>%
         group_by(Code, School)%>%
         summarise(Count = sum(Question1))

And this...

list(c("School2", "School1"), c("S300", "B344")) %>%
    map2( ~ filter(School == .x & Code == .y) %>%
             group_by(Code, School) %>%
             summarise(Count = sum(Question1)))

None of these seem to work, so help would be appreciated!

Sample data:

Code <- c("B344","B555","S300","T220","B888","B888","B555","B344","B344","T220","B555","B555","S300","B555","S300","S300","S300","S300","B344","B344","B888","B888","B888")
School <- c("School1","School1","School2","School3","School4","School4","School1","School1","School3","School3","School4","School1","School1","School3","School2","School2","School4","School2","School3","School4","School3","School1","School2")
Question1 <- c(3,4,5,4,5,5,5,4,5,3,4,5,4,5,4,3,3,3,4,5,4,3,3)
Question2 <- c(5,4,3,4,3,5,4,3,2,3,4,5,4,5,4,3,4,4,5,4,3,3,4)
DF <- data_frame(Code, School, Question1, Question2)

alistaire · Accepted Answer

Here's some options, from most like your code to most optimal:

library(tidyverse)

DF <- data_frame(Code = c("B344", "B555", "S300", "T220", "B888", "B888", "B555", "B344", "B344", "T220", "B555", "B555", "S300", "B555", "S300", "S300", "S300", "S300", "B344", "B344", "B888", "B888", "B888"), 
                 School = c("School1", "School1", "School2", "School3", "School4", "School4", "School1", "School1", "School3", "School3", "School4", "School1", "School1", "School3", "School2", "School2", "School4", "School2", "School3", "School4", "School3", "School1", "School2"), 
                 Question1 = c(3, 4, 5, 4, 5, 5, 5, 4, 5, 3, 4, 5, 4, 5, 4, 3, 3, 3, 4, 5, 4, 3, 3), 
                 Question2 = c(5, 4, 3, 4, 3, 5, 4, 3, 2, 3, 4, 5, 4, 5, 4, 3, 4, 4, 5, 4, 3, 3, 4))

wanted <- data_frame(School = c("School2", "School1"),
                     Code = c("S300", "B344"))

To get map2 to work, if using tilde notation, the variables are named .x and .y; if you use regular function notation, you can call them whatever you like. Don't forget that the first parameter of filter is the data frame piped in, so:

map2_dfr(wanted$School, wanted$Code, ~filter(DF, School == .x, Code == .y)) %>% 
    group_by(School, Code) %>% 
    summarise_all(sum)
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>                
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

Since I set up wanted as a data frame (a vanilla list would work too), you can use pmap instead. Parameter names with pmap can actually be the same as map2 for two variables, but it's really a function with ... for its parameters, so it often makes sense to handle them differently, e.g. with ..1 notation:

wanted %>% 
    pmap_dfr(~filter(DF, School == ..1, Code == ..2)) %>% 
    group_by(School, Code) %>% 
    summarise_all(sum)
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>                
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

The problem with both of the above techniques is that at scale, they will be slow, because they are running filter for every row of wanted, meaning you're retesting each row many times. To keep the code similar, a slightly hacky way to avoid the extra work is to combine the columns into one, e.g. with tidyr::unite:

DF %>% 
    unite(school_code, School, Code) %>% 
    filter(school_code %in% invoke(paste, wanted, sep = '_')) %>%    # or paste(wanted$School, wanted$Code, sep = '_') or equivalent
    separate(school_code, c('School', 'Code')) %>%
    group_by(School, Code) %>% 
    summarise_all(sum)
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>                
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

...or just combine them within filter itself:

DF %>% 
    filter(paste(School, Code) %in% paste(wanted$School, wanted$Code)) %>%    # or invoke(paste, wanted)
    group_by(School, Code) %>% 
    summarise_all(sum)
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>                
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

The best way to get your desired result is perhaps more obvious now that I've set up wanted as a data frame: a join, which is designed to do precisely this job:

DF %>% 
    inner_join(wanted) %>% 
    group_by(School, Code) %>% 
    summarise_all(sum)
#> Joining, by = c("Code", "School")
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>                
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

Using purrr to iterate over two lists and then pipe into dplyr::filter

Answers (1)

Related Questions