Reputation: 1131
I'm using the data.table solution found here: Duplicate entry pooling while averaging values in neighbouring columns
dt.out <- dt[, lapply(.SD, function(x) paste(x, collapse=",")),
by=c("ID2", "chrom", "strand", "txStart", "txEnd")]
dt.out <- dt.out[ ,list(ID=paste(ID, collapse=","), ID2=paste(ID2, collapse=","),
txStart=min(txStart), txEnd=max(txEnd)),
by=c("probe", "chrom", "strand", "newCol")]
Data set:
ID ID2 probe chrom strand txStart txEnd newCol
Rest_3 uc001aah.4 8044649 chr1 0 14361 29370 1.02
Rest_4 uc001aah.4 7911309 chr1 0 14361 29370 1.30
Rest_5 uc001aah.4 8171066 chr1 0 14361 29370 2.80
Rest_6 uc001aah.4 8159790 chr1 0 14361 29370 4.12
Rest_17 uc001abw.1 7896761 chr1 0 861120 879961 1.11
Rest_18 uc001abx.1 7896761 chr1 0 871151 879961 3.12
I added this for
loop in order to get the newCol
to avarage the collapsed vaules that are in a single cell (from the first dt.out
). However it takes ages to run through this loop. Is there a quicker way of doing this?
for(i in 1:NROW(dt.out)){
con <- textConnection(dt.out[i,grep("newCol", colnames(dt.out))])
data <- read.csv(con, sep=",", header=FALSE)
close(con)
dt.out[i,grep("newCol", colnames(dt.out))]<- as.numeric(rowMeans(data))
}
Upvotes: 1
Views: 231
Reputation: 118779
newCol
seems to be an additional column compared to the data in the other question. I guess after obtaining the first dt.out
, you'd want to take the mean of the collapsed values of newCol
?
You can do that by replacing newCol
directly with sapply(strsplit(.))
. Basically, after obtaining the first dt.out
do this:
dt.out[ , newCol := sapply(strsplit(newCol, ","), function(x) mean(as.numeric(x)))]
Upvotes: 2