CadisEtRama
CadisEtRama

Reputation: 1111

ranking multiple data frames and summing across them in R

I have 10 data frames with 2 columns each, I'm calling the dataframes a, b, c, d, e, f, g, h, i and j.

The first column in each data frame is called s for sequences and the second is p for p-values corresponding to each sequence. The s column contains the same sequences across all 10 data frames, essentially the only difference is in the p-values. Below is a short version of data frame a, which has 600,000 rows.

s       p
gtcg    0.06
gtcgg   0.05
gggaa   0.07
cttg    0.05

I want to rank each dataframe by p-value, the smallest p-value should get a rank of 1 and equal p-values should get the same rank. Each final data frame should be in this format:

    s       p_rank_a
    gtcg    2
    gtcgg   1
    gggaa   3
    cttg    1

I've used this to do one:

r<-rank(a$p)

cbind(a$s,r)

but I'm not very familiar with loops and I don't know how to do this automatically. Ultimately I would like a final file that has the s column and in the next column the rank sum of all the ranks across all data frames for each specific sequence. SO basically this:

s       ranksum_P_a-j
gtcg    34
gtcgg   5
gggaa   5009093
cttg    499

Please help and thanks!

Upvotes: 4

Views: 456

Answers (2)

Arun
Arun

Reputation: 118779

I'd put all the data.frames in a list and then use lapply and transform as follows:

my_l <- list(a,b,c) # all your data.frames
# you can use rank but it'll give you the average in case of ties
# lapply(my_l, function(x) transform(x, rank_p = rank(p)))

# I prefer this method instead
my_o <- lapply(my_l, function(x) transform(x, p = as.numeric(factor(p))))

# now bind them in to a single data.frame
my_o <- do.call(rbind, my_o)

# now paste them
aggregate(data = my_o, p ~ s, function(x) paste(x, collapse=","))

#       s     p
# 1  cttg 1,1,1
# 2 gggaa 3,3,3
# 3  gtcg 2,2,2
# 4 gtcgg 1,1,1

Edit since you've asked for a potential faster solution (due to large data), I'd suggest, like @Ricardo, a data.table solution:

require(data.table)
# bind all your data.frames together
dt <- rbindlist(my_l) # my_l is your list of data.frames

# replace p-value with their "rank"
dt[, p := as.numeric(factor(p))]

# set key
setkey(dt, "s")

# combine them using `,`
dt[, list(p_ranks = paste(p, collapse=",")), by=s]

Try this out:

Upvotes: 2

Ricardo Saporta
Ricardo Saporta

Reputation: 55350

for a single data.frame, you can do it one line, as follows:
credit to @Arun for pointing out to use as.numeric(factor(p))

library(data.table)
aDT <- data.table(a)[, p_rank := as.numeric(factor(p))]

I would suggest keeping all the data.frames in a single list, so that you can easily iterate over them. Since your date.frames are letters, it's easy to collect the ten of them:

# collect them all
allOfThem <- lapply(letters[1:10], get, envir=.GlobalEnv)   
# keep in mind you named an object `c`

# convert to DT and create the ranks
allOfThem <- lapply(allOfThem, function(x) data.table(x)[, p_rank := as.numeric(factor(p))])

on a separate note: it might be good habbit to start avoiding naming objects "c" and other common functions in R. otherwise, you will find that you'll start encountering many "unexplainable" behaviors which, after you've beaten your head against a wall for an hour trying to debug it, you realize that you've overwritten the name of a function. This has never happened to me :)

Upvotes: 2

Related Questions