Reputation: 529
I got a following my_data:
geneid chr acc_no start end size strand S1 S2 A1 A2
1 gene_010010 1 AC12345.1 3662 4663 1002 - 328 336 757 874
2 gene_010020 1 AC12345.1 5750 7411 1662 - 480 589 793 765
3 gene_010030 2 AC12345.1 9003 11024 2022 - 653 673 875 920
4 gene_010040 2 AC12345.1 12006 12566 561 - 573 623 483 430
5 gene_010050 3 AC12345.1 15035 17032 1998 - 2256 2333 1866 1944
6 gene_010060 3 AC12345.1 18188 18937 750 - 526 642 650 586
I am able to calculate sums for a given column, i.e:
chr.sums <- data.frame(with (my_data, tapply(S1, INDEX=chr, FUN=sum)))
Problem is, I want to get chr.sums with four columns (S1, S2, A1 and A2) and 30 rows corresponding to unique chr numbers. I do not want to switch to Python back and forth, but looping through columns and assigning output to specific columns in data.frame baffles me.
EDIT Toy data set above.
Upvotes: 1
Views: 1605
Reputation: 23758
tapply
won't handle multiple columns but the formula version of aggregate
will.
chr.sums <- aggregate(cbind(S1,S2,A1,A2) ~ chr, data = my_data, FUN=sum)))
Upvotes: 1
Reputation: 55685
You can use ddply
from plyr
. Here is some code:
plyr::ddply(my_data, .(chr), summarize, S1 = sum(S1), S2 = sum(S2),
A1 = sum(A1), A2 = sum(A2))
EDIT. A more compact solution would be:
plyr::ddply(my_data, .(chr), colwise(sum, .(S1, S2, A1, A2)))
Here is how it works. The data is first split into pieces based on chr
. Then, the columns S1, S2, A1, A2
are summed up for each piece. Finally, they are assembled back into a single data frame.
Any place you have this kind of a split-apply-combine
problem, think plyr
as a solution.
Upvotes: 4