Use index of a list of data.frames to apply a function in certain elements of a data frame

Question

I have a data.frame that looks like this:

>df

  A B   C  P1  P2  P3  P4  P5  P6
1 a 1 0.1 0.1 0.1 0.4 0.2 0.1 0.4
2 b 1 0.2 0.1 0.4 0.2 0.1 0.2 0.2
3 c 1 0.4 0.4 0.1 0.2 0.1 0.1 0.4
4 d 2 0.1 0.1 0.7 0.5 0.1 0.7 0.1
5 e 2 0.5 0.7 0.5 0.1 0.7 0.1 0.5
6 f 2 0.7 0.5 0.5 0.7 0.1 0.7 0.1
7 g 3 0.1 0.1 0.1 0.2 0.2 0.2 0.5
8 h 3 0.2 0.2 0.1 0.5 0.2 0.2 0.5
9 i 3 0.5 0.1 0.2 0.1 0.1 0.5 0.2

And a list of data.frames similar to this one:

list.1 <- list(data.frame(AA=c("a","b","c","d")), 
             data.frame(BB=c("e","f")), 
             data.frame(CC=c("a","b","i")), 
             data.frame(DD=c("d","e","f","g")))

Besides, I have this function:

Fisher.test <- function(p) {
  Xsq <- -2*sum(log(p), na.rm=T)
  p.val <- 1-pchisq(Xsq, df = 2*length(p))
  return(p.val)
}

I would like to select in df those values of df$A that correspond to each data.frame in the list and compute Fisher.test for P1...P6. The way I was doing it is merging df with list.1 and then apply Fisher.method to each data.frame in the list:

func <- function(x,y){merge(x,y, by.x=names(x)[1], by.y=names(y)[1])}

ll <- lapply(list.1, func, df)

ll.fis <- lapply(ll, FUN=function(i){apply(i[,4:9],2,Fisher.test)})

This works but my real data is huge, so I think that a different approach could use the index of elements of list.1[1] to calculate Fisher.test in df storing the result, then use the index of list.1[2] and calculate Fisher.test and so on. In this way, the merging would be avoided because all the calculations are made over df, also, the RAM resources would be also minimised with this approach. However, I have no clue how to achieve this. Perhaps a for loop?

Thanks

Ricardo Saporta · Accepted Answer

Leveraging data.table here is helpful since you can easily subset your data using .( ) syntax and extremely fast, especially with large data compared to working with, say subset

library(data.table)

# convert to data.table, setting the key to the column `A`
DT <- data.table(df, key="A")

p.col.names <- paste0("P", 1:6)
results <- lapply(list.1, function(ll)
        DT[.(ll)][, lapply(.SD, Fisher.test), .SDcols=p.col.names] )

results

side note

You might want to fix the names of list.1 so that the results form lapply are properly named

# fix the names, helpful for the lapply
names(list.1) <- lapply(list.1, names)

results:

$AA
           P1        P2        P3       P4        P5      P6
1: 0.04770305 0.1624142 0.2899578 0.029753 0.1070376 0.17549

$BB
          P1        P2        P3        P4        P5        P6
1: 0.7174377 0.5965736 0.2561482 0.2561482 0.2561482 0.1997866

$CC
          P1       P2       P3         P4        P5        P6
1: 0.0317663 0.139877 0.139877 0.05305057 0.1620897 0.2189595

$DD
         P1        P2        P3        P4        P5        P6
1: 0.184746 0.4246214 0.2704228 0.1070376 0.3215871 0.1519672

Use index of a list of data.frames to apply a function in certain elements of a data frame

Answers (1)

side note

results:

Related Questions