darked89
darked89

Reputation: 529

R: looping through data.frame columns

I got a following my_data:

        geneid chr     acc_no start   end size strand   S1   S2   A1   A2
1 gene_010010   1 AC12345.1  3662  4663 1002      -  328  336  757  874
2 gene_010020   1 AC12345.1  5750  7411 1662      -  480  589  793  765
3 gene_010030   2 AC12345.1  9003 11024 2022      -  653  673  875  920
4 gene_010040   2 AC12345.1 12006 12566  561      -  573  623  483  430
5 gene_010050   3 AC12345.1 15035 17032 1998      - 2256 2333 1866 1944
6 gene_010060   3 AC12345.1 18188 18937  750      -  526  642  650  586

I am able to calculate sums for a given column, i.e:

chr.sums <- data.frame(with (my_data, tapply(S1, INDEX=chr, FUN=sum)))

Problem is, I want to get chr.sums with four columns (S1, S2, A1 and A2) and 30 rows corresponding to unique chr numbers. I do not want to switch to Python back and forth, but looping through columns and assigning output to specific columns in data.frame baffles me.

EDIT Toy data set above.

Upvotes: 1

Views: 1605

Answers (2)

John
John

Reputation: 23758

tapply won't handle multiple columns but the formula version of aggregate will.

chr.sums <- aggregate(cbind(S1,S2,A1,A2) ~ chr, data = my_data, FUN=sum)))

Upvotes: 1

Ramnath
Ramnath

Reputation: 55685

You can use ddply from plyr. Here is some code:

plyr::ddply(my_data, .(chr), summarize, S1 = sum(S1), S2 = sum(S2), 
  A1 = sum(A1), A2 = sum(A2))

EDIT. A more compact solution would be:

plyr::ddply(my_data, .(chr), colwise(sum, .(S1, S2, A1, A2)))

Here is how it works. The data is first split into pieces based on chr. Then, the columns S1, S2, A1, A2 are summed up for each piece. Finally, they are assembled back into a single data frame.

Any place you have this kind of a split-apply-combine problem, think plyr as a solution.

Upvotes: 4

Related Questions