crysis405
crysis405

Reputation: 1131

Optimization a for loop in a data.table

I'm using the data.table solution found here: Duplicate entry pooling while averaging values in neighbouring columns

dt.out <- dt[, lapply(.SD, function(x) paste(x, collapse=",")), 
          by=c("ID2", "chrom", "strand", "txStart", "txEnd")]

dt.out <- dt.out[ ,list(ID=paste(ID, collapse=","), ID2=paste(ID2, collapse=","), 
                       txStart=min(txStart), txEnd=max(txEnd)), 
                       by=c("probe", "chrom", "strand", "newCol")]

Data set:

ID      ID2         probe       chrom   strand txStart  txEnd  newCol
Rest_3  uc001aah.4  8044649     chr1    0      14361    29370  1.02
Rest_4  uc001aah.4  7911309     chr1    0      14361    29370  1.30  
Rest_5  uc001aah.4  8171066     chr1    0      14361    29370  2.80         
Rest_6  uc001aah.4  8159790     chr1    0      14361    29370  4.12 

Rest_17 uc001abw.1  7896761     chr1    0      861120   879961 1.11
Rest_18 uc001abx.1  7896761     chr1    0      871151   879961 3.12

I added this for loop in order to get the newCol to avarage the collapsed vaules that are in a single cell (from the first dt.out). However it takes ages to run through this loop. Is there a quicker way of doing this?

for(i in 1:NROW(dt.out)){
  con <- textConnection(dt.out[i,grep("newCol", colnames(dt.out))])
  data <- read.csv(con, sep=",", header=FALSE)
  close(con)
  dt.out[i,grep("newCol", colnames(dt.out))]<- as.numeric(rowMeans(data)) 

}

Upvotes: 1

Views: 231

Answers (1)

Arun
Arun

Reputation: 118779

newCol seems to be an additional column compared to the data in the other question. I guess after obtaining the first dt.out, you'd want to take the mean of the collapsed values of newCol?

You can do that by replacing newCol directly with sapply(strsplit(.)). Basically, after obtaining the first dt.out do this:

dt.out[ , newCol := sapply(strsplit(newCol, ","), function(x) mean(as.numeric(x)))]

Upvotes: 2

Related Questions