Reputation: 73
I have a data set like this:`
> data
seq desc id sample1 sample2 sample3
1 atgc pqr 123 1.000000e+00 1 1
2 atgc pqr 123 2.000000e+00 2 2
3 atg pq 12 1.000000e+00 1 1
4 atgc pqr 123 3.000000e+00 3 3
5 atg pq 12 2.000000e+00 2 2
6 atg pq 12 7.757019e-05 3 3
7 atg pq 12 1.402031e-05 3 3
I want to split the data on 'seq' column and calculate median of all columns in each group. I want to display columns desc,id as well in the output. The output should be something like this:
seq desc id sample1 sample2 sample3
1 atg pq 12 0.5000388 2.5 2.5
2 atgc pqr 123 2.0000000 2.0 2.0
I have tried using split&lapply combination and the result is,
lapply(split_data,function(x)apply(x[,c(4,5,6)],2,median))
$atg
sample1 sample2 sample3
0.5000388 2.5000000 2.5000000
$atgc
sample1 sample2 sample3
2 2 2
With ddply,
ddply(data,.(seq),function(x)apply(x[,c(4,5,6)],2,median))
seq sample1 sample2 sample3
1 atg 0.5000388 2.5 2.5
2 atgc 2.0000000 2.0 2.0
Is there a way to include desc,id columns from each group to the final data frame to get the output as mentioned above ?
Upvotes: 1
Views: 866
Reputation: 887118
With ddply
you can use colwise
library(plyr)
ddply(data, .(seq, desc, id), colwise(median))
# seq desc id sample1 sample2 sample3
#1 atg pq 12 0.5000388 2.5 2.5
#2 atgc pqr 123 2.0000000 2.0 2.0
Using aggregate
from base R
aggregate(.~seq+desc+id, data, median)
# seq desc id sample1 sample2 sample3
#1 atg pq 12 0.5000388 2.5 2.5
#2 atgc pqr 123 2.0000000 2.0 2.0
A similar option with data.table
first needs the 'class' of the 'sample' columns to be similar as the expected output
library(data.table)
setDT(data)[, 4:6 := lapply(.SD, as.numeric), .SDcols=4:6][,
lapply(.SD, median), .(seq, desc, id)]
# seq desc id sample1 sample2 sample3
#1: atgc pqr 123 2.0000000 2.0 2.0
#2: atg pq 12 0.5000388 2.5 2.5
Upvotes: 1
Reputation: 1123
Assuming desc
and id
don't vary within group, you can do the following with dplyr
data %>%
group_by(seq, id, desc) %>%
summarise_each(funs(median))
Upvotes: 4