Reputation: 4949
Hi suppose I need to summarize by gene from this data.frame here.
g1 = data.frame (
gene = c( "a","a","a","a","b"),
value = c(1,200,3,5,0)
)
gene value
1 a 1
2 a 200
3 a 3
4 a 5
5 b 0
What I want to do is aggregate by gene, but using a function that returns two variables. For this example lets say this function returns a mean and median.
mn <- function ( x ){
return ( c( median(x), mean(x) ))
}
Because the function returns a vector I need to call it twice. Is there a way to split the result up so that I don't have to calculate it twice?
g1 %>%
group_by(gene) %>%
dplyr::summarize(
median = mn ( value )[1], # because mn returns a vector I need to call it twice
mean = mn ( value )[2]
) %>%
data.frame()
Upvotes: 1
Views: 315
Reputation: 69171
You can do this with dplyr, though it's not necessarily as intuitive as other solutions. The do()
function will work however. NOTE - I modified your mn()
function to assign names to the vector that is returned.
Here's the reference page for do()
. The tricky part is how you pass in the object with the .$
notation.
library(dplyr)
g1 = data.frame (
gene = c( "a","a","a","a","b"),
value = c(1,200,3,5,0)
)
mn <- function (x){
return(c(median = median(x), mean = mean(x)))
}
g1 %>% group_by(gene) %>%
do(data.frame(t(mn(.$value)))) %>%
data.frame()
#> gene median mean
#> 1 a 4 52.25
#> 2 b 0 0.00
Created on 2019-01-11 by the reprex package (v0.2.1)
Without diverting into a deep dive between data.table
and dplyr
, here's a timing comparison between the two solutions on a moderately sized chunk of data:
library(data.table)
library(dplyr)
#function
mn <- function (x){
return(list(median = median(x), mean = mean(x)))
}
#bigger data
g1 = data.frame(
gene = gl(1e5, 1e2),
value = rnorm(1e8)
)
f_dt <- function() setDT(g1)[, mn(value), by = gene]
f_dp <- function() g1 %>% group_by(gene) %>% do(data.frame(t(mn(.$value)))) %>% data.frame()
system.time(f_dt())
#> user system elapsed
#> 11.00 1.53 15.35
system.time(f_dp())
#> user system elapsed
#> 38.09 0.37 39.94
Created on 2019-01-11 by the reprex package (v0.2.1)
Upvotes: 1
Reputation: 3223
You can't do that with dplyr
but you can do it with data.table
library(data.table)
g1 = data.table (
gene = c( "a","a","a","a","b"),
value = c(1,200,3,5,0))
mn <- function(x){
return(list(med = median(x), mean = mean(x)))
}
g1[, mn(value), by = gene]
Upvotes: 1