R: split on one column, apply function on each group and display all columns from each group in the output

Question

I have a data set like this:`

  > data
   seq desc  id      sample1 sample2 sample3
1 atgc  pqr 123 1.000000e+00       1       1
2 atgc  pqr 123 2.000000e+00       2       2
3  atg   pq  12 1.000000e+00       1       1
4 atgc  pqr 123 3.000000e+00       3       3
5  atg   pq  12 2.000000e+00       2       2
6  atg   pq  12 7.757019e-05       3       3
7  atg   pq  12 1.402031e-05       3       3

I want to split the data on 'seq' column and calculate median of all columns in each group. I want to display columns desc,id as well in the output. The output should be something like this:

seq desc  id   sample1 sample2 sample3
1  atg   pq  12 0.5000388     2.5     2.5
2 atgc  pqr 123 2.0000000     2.0     2.0

I have tried using split&lapply combination and the result is,

lapply(split_data,function(x)apply(x[,c(4,5,6)],2,median))
$atg
  sample1   sample2   sample3 
0.5000388 2.5000000 2.5000000 

$atgc
sample1 sample2 sample3 
      2       2       2

With ddply,

ddply(data,.(seq),function(x)apply(x[,c(4,5,6)],2,median))
   seq   sample1 sample2 sample3
1  atg 0.5000388     2.5     2.5
2 atgc 2.0000000     2.0     2.0

Is there a way to include desc,id columns from each group to the final data frame to get the output as mentioned above ?

akrun · Accepted Answer

With ddply you can use colwise

library(plyr)
ddply(data, .(seq, desc, id), colwise(median))
#    seq desc  id   sample1 sample2 sample3
#1  atg   pq  12 0.5000388     2.5     2.5
#2 atgc  pqr 123 2.0000000     2.0     2.0

Using aggregate from base R

aggregate(.~seq+desc+id, data, median)
#   seq desc  id   sample1 sample2 sample3
#1  atg   pq  12 0.5000388     2.5     2.5
#2 atgc  pqr 123 2.0000000     2.0     2.0

A similar option with data.table first needs the 'class' of the 'sample' columns to be similar as the expected output

library(data.table)
setDT(data)[, 4:6 := lapply(.SD, as.numeric), .SDcols=4:6][,
                            lapply(.SD, median), .(seq, desc, id)]
#    seq desc  id   sample1 sample2 sample3
#1: atgc  pqr 123 2.0000000     2.0     2.0
#2:  atg   pq  12 0.5000388     2.5     2.5

R: split on one column, apply function on each group and display all columns from each group in the output

Answers (2)

Related Questions