How to speed up the loop running process in R for huge files

Question

I have a genetics sequencing file - 4 millon rows. I'm trying to run a piece of code for variants each unique gene listed.

Here is an exmple of how the data is

CHROM   POS GENE    IMPACT    HOM
1   23455   A   HIGH      HET
1   23675   A   HIGH      HET
1   23895   A   MODERATE    
1   24115   B   LOW   HET
1   24335   B   HIGH      HET
1   24555   B   LOW   HET
2   6789    C   LOW 
2   12346   C   LOW   HET
2   17903   C   MODERATE  HET
2   23460   C   MODERATE    
2   29017   D   LOW   HET
2   34574   D   HIGH    
2   40131   D   HIGH      HET
3   567890  E   HIGH      HET
3   589076  E   HIGH    
3   610262  E   LOW   HET
3   631448  F   HIGH      HET
3   652634  F   MODERATE  HET

And here is my code:

sam <- read.csv("../sample/sample1.txt", sep="	",header=TRUE,stringsAsFactors=FALSE)

glist <- unique(sam[,3])

for(i in glist) {

lice <- subset(sam, GENE == i)
lice$mut <- as.numeric(ifelse((lice[c(4)] == 'MODERATE' | lice[c(4)] == 'HIGH'), c(1), c(0))) 
lice$cntmut <- sum(lice$mut, na.rm=TRUE)
lice$het <- as.numeric(ifelse(lice$HOM == 'HET', c(1), c(0))) 
lice$cnthet <- sum(lice$het, na.rm=TRUE)
lice$cnthetmut <- lice$mut + lice$het
lice$lice <- ifelse(lice$mut == 1 & lice$cntmut >= 2 & lice$het == 1 & lice$cnthet >= 2 &     lice$cnthetmut == 2 , 'lice', '')

write.table(lice,paste0("../sample/list/",i,".txt"),sep="	",quote=F,row.names=F)

}

licelist <- list.files("../sample/list/", full.names=T)   
lice2 <- do.call("rbind",lapply(licelist, FUN=function(files){read.table(files, header=TRUE, sep="	", stringsAsFactors=FALSE)}))

lice_out <- merge(sam,lice2,by.x=c("CHROM","POS"),by.y=c("CHROM","POS"), all=T)

write.table(lice_out,"../sample/sample1_lice.txt",sep="	",quote=F,row.names=F)

I have 30,000 genes which means running this code will take about 2 weeks (the original file is about 4GB in size). I was wondering whether anyone had any advice on how to speed this up? I've tried writing a function to include all this info (which some is repetitive) but to no avail.

Just to add:

The code in the loop is essentially doing the following: 1. adding up how many variants in each gene are moderate or high and how many are are het. 2. lice is given to a variant in a gene if the variant is moderate/high, is a het, and only if there are more than two of these types variants in the gene

For this result:

CHROM   POS GENE    IMPACT  HOM LICE
1   23455   A   HIGH    HET lice
1   23675   A   HIGH    HET lice
1   23895   A   MODERATE        
1   24115   B   LOW HET 
1   24335   B   HIGH    HET 
1   24555   B   LOW HET 
2   6789    C   LOW     
2   12346   C   LOW HET 
2   17903   C   MODERATE    HET 
2   23460   C   MODERATE        
2   29017   D   LOW HET 
2   34574   D   HIGH        
2   40131   D   HIGH    HET 
3   567890  E   HIGH    HET 
3   589076  E   HIGH        
3   610262  E   LOW HET 
3   631448  F   HIGH    HET lice
3   652634  F   MODERATE    HET lice

Like I mentioned a bit further up, the steps are not all necessary but worked at the time when I was doing it on a smaller datat frame.

jlhoward · Accepted Answer

Ir's a bit difficult to help you when you don't explain what you are trying to accomplish, or provide an example of what the desired result looks like with your sample dataset, but here are a few suggestions:

(1) Use data tables. They a much faster and use memory much more efficiently.

(2) Other than the sums (cntmut, cnthet) I don't see why you spit the original table. There are other ways to get the sums without splitting the dataset.

(3) I don't really see the point of the merge at the end.

Here's an option that will likely be much faster.

library(data.table)
dt <- data.table(sam)
setkey(dt,GENE)
dt[,mut:=as.numeric(IMPACT=="MODERATE"|IMPACT=="HIGH")]
dt[,cntmut:=sum(mut), by=GENE]
dt[,het:=as.numeric(HOM=="HET")]
dt[,cnthet:=sum(het),by=GENE]
dt[,cnthetmut:=mut+het]
dt[,lice:=ifelse(mut==1 & cntmut>=2 & het==1 & cnthetmut ==2,'lice',''), by=GENE]
head(dt)
#    CHROM   POS GENE   IMPACT HOM mut cntmut het cnthet cnthetmut lice
# 1:     1 23455    A     HIGH HET   1      3   1      2         2 lice
# 2:     1 23675    A     HIGH HET   1      3   1      2         2 lice
# 3:     1 23895    A MODERATE       1      3   0      2         1     
# 4:     1 24115    B      LOW HET   0      1   1      3         1     
# 5:     1 24335    B     HIGH HET   1      1   1      3         2     
# 6:     1 24555    B      LOW HET   0      1   1      3         1

How to speed up the loop running process in R for huge files

Answers (1)

Related Questions