Reputation: 6874
I have a list of dataframes as follows (dput is way too big even with head=1 so I've had to do a mockup here with str(df_list))
$ OC_AH_026C :'data.frame': 13081 obs. of 3 variables:
..$ chr : num [1:13081] 1 1 1 1 1 1 1 1 1 1 ...
..$ leftPos: num [1:13081] 736092 818159 4105086 4140849 4464314 ...
..$ Means : num [1:13081] 45.183 111.038 162.785 -0.712 83.473 ...
$ OC_AH_026C.1:'data.frame': 13081 obs. of 3 variables:
..$ chr : num [1:13081] 1 1 1 1 1 1 1 1 1 1 ...
..$ leftPos: num [1:13081] 736092 818159 4105086 4140849 4464314 ...
..$ Means : num [1:13081] 69.6 125.1 156.4 12.8 97.4 ...
$ OC_AH_026T :'data.frame': 13081 obs. of 3 variables:
..$ chr : num [1:13081] 1 1 1 1 1 1 1 1 1 1 ...
..$ leftPos: num [1:13081] 736092 818159 4105086 4140849 4464314 ...
..$ Means : num [1:13081] 13 12.5 103.1 56.7 145.4 ...
$ OC_AH_058T :'data.frame': 13081 obs. of 3 variables:
..$ chr : num [1:13081] 1 1 1 1 1 1 1 1 1 1 ...
..$ leftPos: num [1:13081] 736092 818159 4105086 4140849 4464314 ...
..$ Means : num [1:13081] 87.114 118.963 184.31 -0.173 171.733 ...
$ OC_AH_084T :'data.frame': 13081 obs. of 3 variables:
..$ chr : num [1:13081] 1 1 1 1 1 1 1 1 1 1 ...
..$ leftPos: num [1:13081] 736092 818159 4105086 4140849 4464314 ...
..$ Means : num [1:13081] 29.111 103.142 57.476 -0.712 50.156 ...
$ OC_AH_086T :'data.frame': 13081 obs. of 3 variables:
..$ chr : num [1:13081] 1 1 1 1 1 1 1 1 1 1 ...
..$ leftPos: num [1:13081] 736092 818159 4105086 4140849 4464314 ...
..$ Means : num [1:13081] 49.8 81 111.5 47 98.8 ...
$ OC_AH_088T :'data.frame': 13081 obs. of 3 variables:
..$ chr : num [1:13081] 1 1 1 1 1 1 1 1 1 1 ...
..$ leftPos: num [1:13081] 736092 818159 4105086 4140849 4464314 ...
..$ Means : num [1:13081] 117 152 224 121 196 ...
$ OC_AH_096T :'data.frame': 13081 obs. of 3 variables:
..$ chr : num [1:13081] 1 1 1 1 1 1 1 1 1 1 ...
..$ leftPos: num [1:13081] 736092 818159 4105086 4140849 4464314 ...
..$ Means : num [1:13081] 49.5 102.8 93.6 15.2 103.2 ...
I am trying to calculate all the significant scores for each of the third column of each dataframe (Means grouped into bins using dplyr) and if they are significantly elevated they are ascribed a 1 ,significantly depressed a -1 and neither, a zero in a new column for each dataframe.
To do the grouping I have done as follows which works fine:
CLL <- function (col) {
col <- col %>%
group_by(chr, binnum = (leftPos) %/% 500000) %>%
summarise(Means = mean(Means)) %>%
mutate(leftPos = (binnum+1) * 120000) %>%
select(leftPos, Means)}
CML<-lapply(df_list, CLL)
I am stuck on then calculating the upper and lower limits for each Means column in each dataframe. I think this is because I do not know how to reference this column because it is in a list of dataframes. For a non list dataframe I use:
UL = median(col2, na.rm = TRUE) + alpha*IQR(col2[1], na.rm = TRUE)
LL = median(col2, na.rm = TRUE) - alpha*IQR(col2, na.rm = TRUE)
I have tried to reference the third column of each dataframe as follows:
tre<-lapply(CML, "[[", 3)
but of course this extracts the third column and puts it in 'tre' whereas I want to alter the dataframes in the list so that the third column has its relationship with the other two columns maintained.
So..... a) How do I reference the Means column and get the upper and lower limit of each dataframe and then b) on the basis of whether the row in the Means column of each dataframe are >upper limit or
Upvotes: 0
Views: 290
Reputation: 10152
This is what you can do, which is similar to @Roland's answer.
Say that you have data that looks like this (a simplified version of the data you showed):
df_list <- list(OC_AH_026C = data.frame(chr = 1,
leftPos= c(73, 81, 41, 44),
Means = c(111, 111, 162, -0.7)),
OC_AH_026C.1 = data.frame(chr = 1,
leftPos = c(73, 81, 41, 44),
Means = c(69, 125, 156, 12)))
You can use lapply
to "loop" through the elements of the list like this, which calculates the UL and LL of an input (defaults to "leftPos"), additionally, it calculates a binary column (res
) which indicates if the Means
-value is outside of the confidence-interval:
df_list2 <- lapply(df_list, function(df, alpha, col2) {
# perform all your calculations here
df$LL <- median(df[, col2], na.rm = T) - alpha*IQR(df[, col2], na.rm = T)
df$UL <- median(df[, col2], na.rm = T) + alpha*IQR(df[, col2], na.rm = T)
# -1 if Means < LL,
# 1 if Means > UL
# 0 otherwise, nest the operators
# if you wish to calculate more complex conditions
df$res <- 0 + ((df$Means < df$LL)*(-1)) + ((df$Means > df$UL)*1)
return(df)
}, alpha = 0.95, col2 = "Means")
df_list2
# $OC_AH_026C
# chr leftPos Means LL UL res
# 1 1 73 111.0 72.35875 149.6412 0
# 2 1 81 111.0 72.35875 149.6412 0
# 3 1 41 162.0 72.35875 149.6412 1
# 4 1 44 -0.7 72.35875 149.6412 -1
#
# $OC_AH_026C.1
# chr leftPos Means LL UL res
# 1 1 73 69 22.9 171.1 0
# 2 1 81 125 22.9 171.1 0
# 3 1 41 156 22.9 171.1 0
# 4 1 44 12 22.9 171.1 -1
(I hope I got your question right of what you need, otherwise let me know and I will correct the answer).
For the sake of the completeness, I incude a data.table
-way, which is faster (but gets rid of the list-structure). The approach looks like this:
library(data.table)
library(magrittr) # for some piping
# combine all listed data.frames to one data.table with another column, which indicates the name
dt <- lapply(1:length(df_list), function(i) {
nam <- names(df_list)[i]
df <- df_list[[i]]
tmpdt <- data.table(name = nam, df)
}) %>% rbindlist
# calculate the limits
alpha = 0.95
dt[, LL := median(Means, na.rm = T) - alpha*IQR(Means, na.rm = T), by = name]
dt[, UL := median(Means, na.rm = T) + alpha*IQR(Means, na.rm = T), by = name]
dt[, res := 0 + ((df$Means < df$LL)*(-1)) + ((df$Means > df$UL)*1)]
Upvotes: 2