Reputation: 1111
I have 10 data frames with 2 columns each, I'm calling the dataframes a, b, c, d, e, f, g, h, i and j.
The first column in each data frame is called s for sequences and the second is p for p-values corresponding to each sequence. The s column contains the same sequences across all 10 data frames, essentially the only difference is in the p-values. Below is a short version of data frame a, which has 600,000 rows.
s p
gtcg 0.06
gtcgg 0.05
gggaa 0.07
cttg 0.05
I want to rank each dataframe by p-value, the smallest p-value should get a rank of 1 and equal p-values should get the same rank. Each final data frame should be in this format:
s p_rank_a
gtcg 2
gtcgg 1
gggaa 3
cttg 1
I've used this to do one:
r<-rank(a$p)
cbind(a$s,r)
but I'm not very familiar with loops and I don't know how to do this automatically. Ultimately I would like a final file that has the s column and in the next column the rank sum of all the ranks across all data frames for each specific sequence. SO basically this:
s ranksum_P_a-j
gtcg 34
gtcgg 5
gggaa 5009093
cttg 499
Please help and thanks!
Upvotes: 4
Views: 456
Reputation: 118779
I'd put all the data.frames
in a list
and then use lapply
and transform
as follows:
my_l <- list(a,b,c) # all your data.frames
# you can use rank but it'll give you the average in case of ties
# lapply(my_l, function(x) transform(x, rank_p = rank(p)))
# I prefer this method instead
my_o <- lapply(my_l, function(x) transform(x, p = as.numeric(factor(p))))
# now bind them in to a single data.frame
my_o <- do.call(rbind, my_o)
# now paste them
aggregate(data = my_o, p ~ s, function(x) paste(x, collapse=","))
# s p
# 1 cttg 1,1,1
# 2 gggaa 3,3,3
# 3 gtcg 2,2,2
# 4 gtcgg 1,1,1
Edit since you've asked for a potential faster solution (due to large data), I'd suggest, like @Ricardo, a data.table
solution:
require(data.table)
# bind all your data.frames together
dt <- rbindlist(my_l) # my_l is your list of data.frames
# replace p-value with their "rank"
dt[, p := as.numeric(factor(p))]
# set key
setkey(dt, "s")
# combine them using `,`
dt[, list(p_ranks = paste(p, collapse=",")), by=s]
Try this out:
Upvotes: 2
Reputation: 55350
for a single data.frame, you can do it one line, as follows:
credit to @Arun for pointing out to use as.numeric(factor(p))
library(data.table)
aDT <- data.table(a)[, p_rank := as.numeric(factor(p))]
I would suggest keeping all the data.frames in a single list, so that you can easily iterate over them. Since your date.frames are letters, it's easy to collect the ten of them:
# collect them all
allOfThem <- lapply(letters[1:10], get, envir=.GlobalEnv)
# keep in mind you named an object `c`
# convert to DT and create the ranks
allOfThem <- lapply(allOfThem, function(x) data.table(x)[, p_rank := as.numeric(factor(p))])
on a separate note: it might be good habbit to start avoiding naming objects "c
" and other common functions in R
. otherwise, you will find that you'll start encountering many "unexplainable" behaviors which, after you've beaten your
head against a wall for an hour trying to debug it, you realize that you've overwritten the name of a function. This has never happened to me :)
Upvotes: 2