Using apply() to manipulate multiple columns

Question

I've got a cross-tab frequency table where the measure is CAG and columns A01, A02 etc are frequency counts. i.e. 6485 counts of 13 CAG, 35 counts of CAG 14.

I'm trying to:

Set all values in A01, A02 etc that are <0.2 the size of the highest value in that column (i.e. exclude those not meeting a 20% threshold).
Normalise each value in A01, A02 etc by dividing each value in that column by the sum of all values in that column. This will give a value between 0-1 for each row in the column.
Multiply each value in A01, A02 etc by the change in CAG. Change in CAG is the value in the CAG column minus the modal CAG value.
I then need to sum all the values in each column.

I've had a stab at it here, but unfortunately I'm not sure how to progress further. Would appreciate any help!

data <- data.frame(CAG = c(13, 14, 15, 17), 
               A01 = c(6485,35,132, 12), 
               A02 = c(0,42,56, 4))
iithreshold <- 0.2

ii <- lapply(data[, 8:ncol(height)], function(x) {
  mod <- data$CAG[which.max(x)]
  x < (iithreshold * max(x)) <- 0
  ii2 <- (x / sum (x)) * (height$CAG - mod)
})

ii3 <- sum(ii2)

ii3 <- as.data.frame(ii3)
ii3 <- t(ii3)

Great news! I've now got it working and giving the right results. Thanks so much! I must have had a typo somewhere. I just restarted from scratch. This is the working code:

library(data.table) 
dataDT <- data.frame(height[,7:ncol(height)])
dataDT <- setDT(dataDT)
iithreshold <- 0.2

colsToBeUsed<-names(dataDT[,!'CAG'])
sumDataSetdata<-c()
iiht<-unlist(lapply(X=1:length(colsToBeUsed),function(X){s=colsToBeUsed[X]
eval(parse(text=paste0('dataDT[',s,'



I've been going through trying to understand what each of your lines of functions do, but am still not sure. For my education I don't suppose you would be able to let me know what each is doing? 
Thank you again!

NpT · Accepted Answer

Hi I would not go with base R for data manipulation, although this is possible. I would use either data.table or dplyr packages for that.

I have to note that this is not the only way to go and the overhead of data table has to be taken into account and then decide between the two aforementioned packages.

Since you have n number of columns, I think that the use of .SD along with .SDcols is what you need in data.table terms. For example let say you have A01 to A0n columns. Then you can have:

colsToBeUsed=names(data[,!('CAG')])  

data[ , lapply(.SD, {your formula as a function}), .SDCols=c(colsToBeUsed)]

In any case, in base R lapply is faster that for loops that is why I would recommend the use of lapply.

After getting a comment about a way to do the coding, I provide two options: First with for loop:

library(data.table)
dataDT<- data.frame(CAG = c(13, 14, 15, 17), 
                   A01 = c(6485,35,132, 12), 
                   A02 = c(0,42,56, 4))
thres <- 0.2
dataDT<-setDT(dataDT)
colsToBeUsed<-names(dataDT[,!'CAG'])
sumDataSetdata<-c()  


for(X in colsToBeUsed){
  eval(parse(text=paste0("dataDT[",X,"



Second with lapply:

library(data.table)
dataDT<- data.frame(CAG = c(13, 14, 15, 17), 
                    A01 = c(6485,35,132, 12), 
                    A02 = c(0,42,56, 4))

thres <- 0.2
dataDT<-setDT(dataDT)

colsToBeUsed<-names(dataDT[,!'CAG'])
sumDataSetdata<-c()
sumDataSet<-unlist(lapply(X=1:length(colsToBeUsed),function(X){s=colsToBeUsed[X]
  eval(parse(text=paste0('dataDT[',s,'

Using apply() to manipulate multiple columns

Answers (1)

Related Questions