Reputation: 627
I have a genetics sequencing file - 4 millon rows. I'm trying to run a piece of code for variants each unique gene listed.
Here is an exmple of how the data is
CHROM POS GENE IMPACT HOM
1 23455 A HIGH HET
1 23675 A HIGH HET
1 23895 A MODERATE
1 24115 B LOW HET
1 24335 B HIGH HET
1 24555 B LOW HET
2 6789 C LOW
2 12346 C LOW HET
2 17903 C MODERATE HET
2 23460 C MODERATE
2 29017 D LOW HET
2 34574 D HIGH
2 40131 D HIGH HET
3 567890 E HIGH HET
3 589076 E HIGH
3 610262 E LOW HET
3 631448 F HIGH HET
3 652634 F MODERATE HET
And here is my code:
sam <- read.csv("../sample/sample1.txt", sep="\t",header=TRUE,stringsAsFactors=FALSE)
glist <- unique(sam[,3])
for(i in glist) {
lice <- subset(sam, GENE == i)
lice$mut <- as.numeric(ifelse((lice[c(4)] == 'MODERATE' | lice[c(4)] == 'HIGH'), c(1), c(0)))
lice$cntmut <- sum(lice$mut, na.rm=TRUE)
lice$het <- as.numeric(ifelse(lice$HOM == 'HET', c(1), c(0)))
lice$cnthet <- sum(lice$het, na.rm=TRUE)
lice$cnthetmut <- lice$mut + lice$het
lice$lice <- ifelse(lice$mut == 1 & lice$cntmut >= 2 & lice$het == 1 & lice$cnthet >= 2 & lice$cnthetmut == 2 , 'lice', '')
write.table(lice,paste0("../sample/list/",i,".txt"),sep="\t",quote=F,row.names=F)
}
licelist <- list.files("../sample/list/", full.names=T)
lice2 <- do.call("rbind",lapply(licelist, FUN=function(files){read.table(files, header=TRUE, sep="\t", stringsAsFactors=FALSE)}))
lice_out <- merge(sam,lice2,by.x=c("CHROM","POS"),by.y=c("CHROM","POS"), all=T)
write.table(lice_out,"../sample/sample1_lice.txt",sep="\t",quote=F,row.names=F)
I have 30,000 genes which means running this code will take about 2 weeks (the original file is about 4GB in size). I was wondering whether anyone had any advice on how to speed this up? I've tried writing a function to include all this info (which some is repetitive) but to no avail.
Just to add:
The code in the loop is essentially doing the following:
1. adding up how many variants in each gene are moderate
or high
and how many are are het
.
2. lice
is given to a variant in a gene if the variant is moderate/high
, is a het
, and only if there are more than two of these types variants in the gene
For this result:
CHROM POS GENE IMPACT HOM LICE
1 23455 A HIGH HET lice
1 23675 A HIGH HET lice
1 23895 A MODERATE
1 24115 B LOW HET
1 24335 B HIGH HET
1 24555 B LOW HET
2 6789 C LOW
2 12346 C LOW HET
2 17903 C MODERATE HET
2 23460 C MODERATE
2 29017 D LOW HET
2 34574 D HIGH
2 40131 D HIGH HET
3 567890 E HIGH HET
3 589076 E HIGH
3 610262 E LOW HET
3 631448 F HIGH HET lice
3 652634 F MODERATE HET lice
Like I mentioned a bit further up, the steps are not all necessary but worked at the time when I was doing it on a smaller datat frame.
Upvotes: 1
Views: 109
Reputation: 59415
Ir's a bit difficult to help you when you don't explain what you are trying to accomplish, or provide an example of what the desired result looks like with your sample dataset, but here are a few suggestions:
(1) Use data tables. They a much faster and use memory much more efficiently.
(2) Other than the sums (cntmut, cnthet) I don't see why you spit the original table. There are other ways to get the sums without splitting the dataset.
(3) I don't really see the point of the merge at the end.
Here's an option that will likely be much faster.
library(data.table)
dt <- data.table(sam)
setkey(dt,GENE)
dt[,mut:=as.numeric(IMPACT=="MODERATE"|IMPACT=="HIGH")]
dt[,cntmut:=sum(mut), by=GENE]
dt[,het:=as.numeric(HOM=="HET")]
dt[,cnthet:=sum(het),by=GENE]
dt[,cnthetmut:=mut+het]
dt[,lice:=ifelse(mut==1 & cntmut>=2 & het==1 & cnthetmut ==2,'lice',''), by=GENE]
head(dt)
# CHROM POS GENE IMPACT HOM mut cntmut het cnthet cnthetmut lice
# 1: 1 23455 A HIGH HET 1 3 1 2 2 lice
# 2: 1 23675 A HIGH HET 1 3 1 2 2 lice
# 3: 1 23895 A MODERATE 1 3 0 2 1
# 4: 1 24115 B LOW HET 0 1 1 3 1
# 5: 1 24335 B HIGH HET 1 1 1 3 2
# 6: 1 24555 B LOW HET 0 1 1 3 1
Upvotes: 3