Reputation: 1446
I have a data.frame
that looks like this:
>df
A B C P1 P2 P3 P4 P5 P6
1 a 1 0.1 0.1 0.1 0.4 0.2 0.1 0.4
2 b 1 0.2 0.1 0.4 0.2 0.1 0.2 0.2
3 c 1 0.4 0.4 0.1 0.2 0.1 0.1 0.4
4 d 2 0.1 0.1 0.7 0.5 0.1 0.7 0.1
5 e 2 0.5 0.7 0.5 0.1 0.7 0.1 0.5
6 f 2 0.7 0.5 0.5 0.7 0.1 0.7 0.1
7 g 3 0.1 0.1 0.1 0.2 0.2 0.2 0.5
8 h 3 0.2 0.2 0.1 0.5 0.2 0.2 0.5
9 i 3 0.5 0.1 0.2 0.1 0.1 0.5 0.2
And a list of data.frames similar to this one:
list.1 <- list(data.frame(AA=c("a","b","c","d")),
data.frame(BB=c("e","f")),
data.frame(CC=c("a","b","i")),
data.frame(DD=c("d","e","f","g")))
Besides, I have this function:
Fisher.test <- function(p) {
Xsq <- -2*sum(log(p), na.rm=T)
p.val <- 1-pchisq(Xsq, df = 2*length(p))
return(p.val)
}
I would like to select in df
those values of df$A that correspond to each data.frame in the list and compute Fisher.test
for P1...P6. The way I was doing it is merging df
with list.1
and then apply Fisher.method
to each data.frame
in the list:
func <- function(x,y){merge(x,y, by.x=names(x)[1], by.y=names(y)[1])}
ll <- lapply(list.1, func, df)
ll.fis <- lapply(ll, FUN=function(i){apply(i[,4:9],2,Fisher.test)})
This works but my real data is huge, so I think that a different approach could use the index of elements of list.1[1]
to calculate Fisher.test
in df
storing the result, then use the index of list.1[2]
and calculate Fisher.test
and so on. In this way, the merging would be avoided because all the calculations are made over df
, also, the RAM resources would be also minimised with this approach. However, I have no clue how to achieve this. Perhaps a for loop?
Thanks
Upvotes: 2
Views: 364
Reputation: 55350
Leveraging data.table here is helpful since you can easily subset your data using .( )
syntax and extremely fast, especially with large data compared to working with, say subset
library(data.table)
# convert to data.table, setting the key to the column `A`
DT <- data.table(df, key="A")
p.col.names <- paste0("P", 1:6)
results <- lapply(list.1, function(ll)
DT[.(ll)][, lapply(.SD, Fisher.test), .SDcols=p.col.names] )
results
You might want to fix the names of list.1
so that the results form lapply
are properly named
# fix the names, helpful for the lapply
names(list.1) <- lapply(list.1, names)
$AA
P1 P2 P3 P4 P5 P6
1: 0.04770305 0.1624142 0.2899578 0.029753 0.1070376 0.17549
$BB
P1 P2 P3 P4 P5 P6
1: 0.7174377 0.5965736 0.2561482 0.2561482 0.2561482 0.1997866
$CC
P1 P2 P3 P4 P5 P6
1: 0.0317663 0.139877 0.139877 0.05305057 0.1620897 0.2189595
$DD
P1 P2 P3 P4 P5 P6
1: 0.184746 0.4246214 0.2704228 0.1070376 0.3215871 0.1519672
Upvotes: 3