R efficiently add up tables in different order

Question

At some point in my code, I get a list of tables that looks much like this:

[[1]]
     cluster_size start end number       p_value
13             2    12  13    131 4.209645e-233
12             1    12  12    100 6.166824e-185
22            11    12  22    132 6.916323e-143
23            12    12  23    133 1.176194e-139
13             1    13  13     31  3.464284e-38
13            68    13 117     34  3.275941e-37
23            78    23 117      2  4.503111e-32

....

[[2]]
      cluster_size start end number       p_value
13             2    12  13    131 4.209645e-233
12             1    12  12    100 6.166824e-185
22            11    12  22    132 6.916323e-143
23            12    12  23    133 1.176194e-139
13             1    13  13     31  3.464284e-38

....

While I don't show the full table here I know they are all the same size. What I want to do is make one table where I add up the p-values. Problem is that the $cluster_size, start, $end and $number columns don't necessarily correspond to the same row when I look at the table in different list elements so I can't just do a simple sum.

The brute force way to do this is to: 1) make a blank table 2) copy in the appropriate $cluster_size, $start, $end, $number columns from the first table and pull the correct p-values using a which() statement from all the tables. Is there a more clever way of doing this? Or is this pretty much it?

Edit: I was asked for a dput file of the data. It's located here: http://alrig.com/code/

In the sample case, the order of the rows happen to match. That will not always be the case.

Chase · Accepted Answer

Seems like you can do this in two steps

Convert your list to a data.frame
Use any of the split-apply-combine approaches to summarize.

Assuming your data was named X, here's what you could do:

library(plyr)
#need to convert to data.frame since all of your list objects are of class matrix
XDF <- as.data.frame(do.call("rbind", X))
ddply(XDF, .(cluster_size, start, end, number), summarize, sump = sum(p_value))
#-----
   cluster_size start end number          sump
1             1    12  12    100 5.550142e-184
2             1    13  13     31  3.117856e-37
3             1    22  22      1  9.000000e+00
...
29          105    23 117      2  6.271469e-16
30          106    22 146     13  7.266746e-25
31          107    23 146     12  1.382328e-25

Lots of other aggregation techniques are covered here. I'd look at data.table package if your data is large.

R efficiently add up tables in different order

Answers (1)

Related Questions