Sebastian Zeki
Sebastian Zeki

Reputation: 6874

How to perform function on a list of dataframes

I have a list of dataframes as follows (dput is way too big even with head=1 so I've had to do a mockup here with str(df_list))

$ OC_AH_026C  :'data.frame':    13081 obs. of  3 variables:
  ..$ chr    : num [1:13081] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ leftPos: num [1:13081] 736092 818159 4105086 4140849 4464314 ...
  ..$ Means  : num [1:13081] 45.183 111.038 162.785 -0.712 83.473 ...
 $ OC_AH_026C.1:'data.frame':   13081 obs. of  3 variables:
  ..$ chr    : num [1:13081] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ leftPos: num [1:13081] 736092 818159 4105086 4140849 4464314 ...
  ..$ Means  : num [1:13081] 69.6 125.1 156.4 12.8 97.4 ...
 $ OC_AH_026T  :'data.frame':   13081 obs. of  3 variables:
  ..$ chr    : num [1:13081] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ leftPos: num [1:13081] 736092 818159 4105086 4140849 4464314 ...
  ..$ Means  : num [1:13081] 13 12.5 103.1 56.7 145.4 ...
 $ OC_AH_058T  :'data.frame':   13081 obs. of  3 variables:
  ..$ chr    : num [1:13081] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ leftPos: num [1:13081] 736092 818159 4105086 4140849 4464314 ...
  ..$ Means  : num [1:13081] 87.114 118.963 184.31 -0.173 171.733 ...
 $ OC_AH_084T  :'data.frame':   13081 obs. of  3 variables:
  ..$ chr    : num [1:13081] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ leftPos: num [1:13081] 736092 818159 4105086 4140849 4464314 ...
  ..$ Means  : num [1:13081] 29.111 103.142 57.476 -0.712 50.156 ...
 $ OC_AH_086T  :'data.frame':   13081 obs. of  3 variables:
  ..$ chr    : num [1:13081] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ leftPos: num [1:13081] 736092 818159 4105086 4140849 4464314 ...
  ..$ Means  : num [1:13081] 49.8 81 111.5 47 98.8 ...
 $ OC_AH_088T  :'data.frame':   13081 obs. of  3 variables:
  ..$ chr    : num [1:13081] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ leftPos: num [1:13081] 736092 818159 4105086 4140849 4464314 ...
  ..$ Means  : num [1:13081] 117 152 224 121 196 ...
 $ OC_AH_096T  :'data.frame':   13081 obs. of  3 variables:
  ..$ chr    : num [1:13081] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ leftPos: num [1:13081] 736092 818159 4105086 4140849 4464314 ...
  ..$ Means  : num [1:13081] 49.5 102.8 93.6 15.2 103.2 ...

I am trying to calculate all the significant scores for each of the third column of each dataframe (Means grouped into bins using dplyr) and if they are significantly elevated they are ascribed a 1 ,significantly depressed a -1 and neither, a zero in a new column for each dataframe.

To do the grouping I have done as follows which works fine:

CLL <- function (col) {
col <- col %>%
  group_by(chr, binnum = (leftPos) %/% 500000) %>%
  summarise(Means = mean(Means)) %>%
  mutate(leftPos = (binnum+1) * 120000) %>%
  select(leftPos, Means)}

CML<-lapply(df_list, CLL)

I am stuck on then calculating the upper and lower limits for each Means column in each dataframe. I think this is because I do not know how to reference this column because it is in a list of dataframes. For a non list dataframe I use:

UL = median(col2, na.rm = TRUE) + alpha*IQR(col2[1], na.rm = TRUE)
LL = median(col2, na.rm = TRUE) - alpha*IQR(col2, na.rm = TRUE)

I have tried to reference the third column of each dataframe as follows:

tre<-lapply(CML, "[[", 3)

but of course this extracts the third column and puts it in 'tre' whereas I want to alter the dataframes in the list so that the third column has its relationship with the other two columns maintained.

So..... a) How do I reference the Means column and get the upper and lower limit of each dataframe and then b) on the basis of whether the row in the Means column of each dataframe are >upper limit or

Upvotes: 0

Views: 290

Answers (1)

David
David

Reputation: 10152

This is what you can do, which is similar to @Roland's answer.

Say that you have data that looks like this (a simplified version of the data you showed):

df_list <- list(OC_AH_026C = data.frame(chr = 1, 
                                        leftPos= c(73, 81, 41, 44),
                                        Means = c(111, 111, 162, -0.7)),
                OC_AH_026C.1 = data.frame(chr = 1,
                                          leftPos = c(73, 81, 41, 44),
                                          Means = c(69, 125, 156, 12)))

You can use lapply to "loop" through the elements of the list like this, which calculates the UL and LL of an input (defaults to "leftPos"), additionally, it calculates a binary column (res) which indicates if the Means-value is outside of the confidence-interval:

df_list2 <- lapply(df_list, function(df, alpha, col2) { 

  # perform all your calculations here
  df$LL <- median(df[, col2], na.rm = T) - alpha*IQR(df[, col2], na.rm = T)
  df$UL <- median(df[, col2], na.rm = T) + alpha*IQR(df[, col2], na.rm = T)

  # -1 if Means < LL, 
  # 1 if Means > UL
  # 0 otherwise, nest the operators 
  # if you wish to calculate more complex conditions
  df$res <- 0 + ((df$Means < df$LL)*(-1)) + ((df$Means > df$UL)*1)

  return(df)
}, alpha = 0.95, col2 = "Means")

df_list2
# $OC_AH_026C
# chr leftPos Means       LL       UL res
# 1   1      73 111.0 72.35875 149.6412   0
# 2   1      81 111.0 72.35875 149.6412   0
# 3   1      41 162.0 72.35875 149.6412   1
# 4   1      44  -0.7 72.35875 149.6412  -1
# 
# $OC_AH_026C.1
# chr leftPos Means   LL    UL res
# 1   1      73    69 22.9 171.1   0
# 2   1      81   125 22.9 171.1   0
# 3   1      41   156 22.9 171.1   0
# 4   1      44    12 22.9 171.1  -1

(I hope I got your question right of what you need, otherwise let me know and I will correct the answer).

data.table way

For the sake of the completeness, I incude a data.table-way, which is faster (but gets rid of the list-structure). The approach looks like this:

library(data.table)
library(magrittr) # for some piping

# combine all listed data.frames to one data.table with another column, which indicates the name
dt <- lapply(1:length(df_list), function(i) {
  nam <- names(df_list)[i]
  df <- df_list[[i]]
  tmpdt <- data.table(name = nam, df)
}) %>% rbindlist

# calculate the limits
alpha = 0.95
dt[, LL := median(Means, na.rm = T) - alpha*IQR(Means, na.rm = T), by = name]
dt[, UL := median(Means, na.rm = T) + alpha*IQR(Means, na.rm = T), by = name]

dt[, res := 0 + ((df$Means < df$LL)*(-1)) + ((df$Means > df$UL)*1)]

Upvotes: 2

Related Questions