Reputation: 1141
I've got a cross-tab frequency table where the measure is CAG and columns A01, A02 etc are frequency counts. i.e. 6485 counts of 13 CAG, 35 counts of CAG 14.
I'm trying to:
I've had a stab at it here, but unfortunately I'm not sure how to progress further. Would appreciate any help!
data <- data.frame(CAG = c(13, 14, 15, 17),
A01 = c(6485,35,132, 12),
A02 = c(0,42,56, 4))
iithreshold <- 0.2
ii <- lapply(data[, 8:ncol(height)], function(x) {
mod <- data$CAG[which.max(x)]
x < (iithreshold * max(x)) <- 0
ii2 <- (x / sum (x)) * (height$CAG - mod)
})
ii3 <- sum(ii2)
ii3 <- as.data.frame(ii3)
ii3 <- t(ii3)
Great news! I've now got it working and giving the right results. Thanks so much! I must have had a typo somewhere. I just restarted from scratch. This is the working code:
library(data.table)
dataDT <- data.frame(height[,7:ncol(height)])
dataDT <- setDT(dataDT)
iithreshold <- 0.2
colsToBeUsed<-names(dataDT[,!'CAG'])
sumDataSetdata<-c()
iiht<-unlist(lapply(X=1:length(colsToBeUsed),function(X){s=colsToBeUsed[X]
eval(parse(text=paste0('dataDT[',s,'<iithreshold*max(',s,'),',s,':=0]')))
eval(parse(text=paste0('dataDT[,MAX',s,':=dataDT[',s,'==max(',s,'),CAG]]')))
eval(parse(text=paste0('dataDT[,norm',s,':=',s,'/sum(',s,')]')))
eval(parse(text=paste0('dataDT[,sum',s,':=',s,'/sum(',s,')*(CAG-MAX',s,'),]')))
eval(parse(text=paste0('rbind(sumDataSetdata,dataDT[,sum(sum',s,')])')))
}))
I've been going through trying to understand what each of your lines of functions do, but am still not sure. For my education I don't suppose you would be able to let me know what each is doing? Thank you again!
Upvotes: 0
Views: 315
Reputation: 451
Hi I would not go with base R for data manipulation, although this is possible. I would use either data.table or dplyr packages for that.
I have to note that this is not the only way to go and the overhead of data table has to be taken into account and then decide between the two aforementioned packages.
Since you have n
number of columns, I think that the use of .SD
along with .SDcols
is what you need in data.table terms.
For example let say you have A01 to A0n columns. Then you can have:
colsToBeUsed=names(data[,!('CAG')])
data[ , lapply(.SD, {your formula as a function}), .SDCols=c(colsToBeUsed)]
In any case, in base R lapply is faster that for loops that is why I would recommend the use of lapply.
After getting a comment about a way to do the coding, I provide two options: First with for loop:
library(data.table)
dataDT<- data.frame(CAG = c(13, 14, 15, 17),
A01 = c(6485,35,132, 12),
A02 = c(0,42,56, 4))
thres <- 0.2
dataDT<-setDT(dataDT)
colsToBeUsed<-names(dataDT[,!'CAG'])
sumDataSetdata<-c()
for(X in colsToBeUsed){
eval(parse(text=paste0("dataDT[",X,"<thres*max(",X,"),",X,":=0]")))
eval(parse(text=paste0("dataDT[,MAX",X,":=dataDT[",X,"==max(",X,"),CAG]]")))
eval(parse(text=paste0("dataDT[,norm",X,":=",X,"/sum(",X,")]")))
eval(parse(text=paste0("dataDT[,sum",X,":=",X,"/sum(",X,")*(CAG-MAX",X,"),]")))
eval(parse(text=paste0("sumDataSetdata<-rbind(sumDataSetdata,dataDT[,sum(sum",X,")])")))
}
Second with lapply:
library(data.table)
dataDT<- data.frame(CAG = c(13, 14, 15, 17),
A01 = c(6485,35,132, 12),
A02 = c(0,42,56, 4))
thres <- 0.2
dataDT<-setDT(dataDT)
colsToBeUsed<-names(dataDT[,!'CAG'])
sumDataSetdata<-c()
sumDataSet<-unlist(lapply(X=1:length(colsToBeUsed),function(X){s=colsToBeUsed[X]
eval(parse(text=paste0('dataDT[',s,'<thres*max(',s,'),',s,':=0]')))
eval(parse(text=paste0('dataDT[,MAX',s,':=dataDT[',s,'==max(',s,'),CAG]]')))
eval(parse(text=paste0('dataDT[,norm',s,':=',s,'/sum(',s,')]')))
eval(parse(text=paste0('dataDT[,sum',s,':=',s,'/sum(',s,')*(CAG-MAX',s,'),]')))
eval(parse(text=paste0('rbind(sumDataSetdata,dataDT[,sum(sum',s,')])')))
}))
Upvotes: 1