dplyr how to summarize and split result from a function that returns a vector

Question

Hi suppose I need to summarize by gene from this data.frame here.

g1 = data.frame ( 
      gene = c( "a","a","a","a","b"),
      value = c(1,200,3,5,0)
    )
  gene value
1    a     1
2    a   200
3    a     3
4    a     5
5    b     0

What I want to do is aggregate by gene, but using a function that returns two variables. For this example lets say this function returns a mean and median.

mn <- function ( x ){
    return  ( c( median(x), mean(x) ))
}

Because the function returns a vector I need to call it twice. Is there a way to split the result up so that I don't have to calculate it twice?

g1 %>%
    group_by(gene) %>%
    dplyr::summarize(
        median = mn ( value )[1],  # because mn returns a vector I need to call it twice
        mean = mn ( value )[2]
    ) %>%
    data.frame()

Chase · Accepted Answer

You can do this with dplyr, though it's not necessarily as intuitive as other solutions. The do() function will work however. NOTE - I modified your mn() function to assign names to the vector that is returned.

Here's the reference page for do(). The tricky part is how you pass in the object with the .$ notation.

library(dplyr)
g1 = data.frame ( 
  gene = c( "a","a","a","a","b"),
  value = c(1,200,3,5,0)
)

mn <- function (x){
  return(c(median = median(x), mean = mean(x)))
}


g1 %>% group_by(gene) %>% 
  do(data.frame(t(mn(.$value)))) %>%
  data.frame()
#>   gene median  mean
#> 1    a      4 52.25
#> 2    b      0  0.00

^{Created on 2019-01-11 by the reprex package (v0.2.1)}

Without diverting into a deep dive between data.table and dplyr, here's a timing comparison between the two solutions on a moderately sized chunk of data:

library(data.table)
library(dplyr)
#function
mn <- function (x){
  return(list(median = median(x), mean = mean(x)))
}

#bigger data
g1 = data.frame( 
  gene = gl(1e5, 1e2),
  value = rnorm(1e8)
)

f_dt <- function() setDT(g1)[, mn(value), by = gene]
f_dp <- function() g1 %>% group_by(gene) %>% do(data.frame(t(mn(.$value)))) %>% data.frame()

system.time(f_dt())
#>    user  system elapsed 
#>   11.00    1.53   15.35
system.time(f_dp())
#>    user  system elapsed 
#>   38.09    0.37   39.94

^{Created on 2019-01-11 by the reprex package (v0.2.1)}

dplyr how to summarize and split result from a function that returns a vector

Answers (2)

Related Questions