Reputation: 45
I've seen a lot of tips going around on how to speed up python code or make it more efficient. I've tried some things on the code below like: change global variables to local variables, whenever possible, using .format to create strings instead of adding strings, trying to not create multiple variables. But still, this script takes 1h25 to run. I have two input files:
1) A bed file, two column (tab delimited) file with number or code in the first column, and numbers in the second column. It has ~2 billion lines, where the combination of numbers is unique (it has all the positions in a genome; the first column is the chromosome, the second is the position):
1 1
1 2
1 3
1 4
...
2) a complex file, where the first few (~3000 lines) are a header that start with #, and then an entry for, again, a combination of number/code + number in the first two columns. This two columns make the link with the first file (1 1 in file 1 is the the same as 1 1 in file 2). This has ~22 million of rows. Here is an example of the first three lines:
1 1 . G . 32.9939 . DP=1;MQ0F=0;AF1=0;AC1=0;DP4=1,0,0,0;MQ=60;FQ=-29.9923 GT:PL:GQ 0/1:0:36
1 2 . T . 32.9939 . DP=1;MQ0F=0;AF1=0;AC1=0;DP4=1,0,0,0;MQ=60;FQ=-29.9923 GT:PL:GQ ./.:0:36
1 3 . C . 32.9939 . DP=1;MQ0F=0;AF1=0;AC1=0;DP4=1,0,0,0;MQ=60;FQ=-29.9923 GT:PL:GQ 1/1:0:36
Question: I want to filter rows in the first file, if those rows on the second file have a 0/0, 0/1 or 1/1 (the 4th possibility is ./.) in the last column (so I need to pars the last column, to reach those three characters)
The added complexity is that file #2 has to be read through a pipe from another program because it's compressed in a specific way done by that program (opening this file takes a long time on its own but nothing I can do about it...)
Call: program view file2.vcf.gz | my_script.py file1.bed
import sys
import re
import time
start_time = time.time()
def make_vcf_dict(vcf):
mydict={}
for line in (line for line in vcf if not line.startswith("#")):
line=line.strip().split()
genotype=line[-1].split(':')[0]
motif=re.compile('\./\.')
if motif.match(genotype) is None:
mydict.setdefault('{}:{}'.format(line[0],line[1]),'{}:{}'.format(line[0],line[1]))
return mydict
def create_output_bed(bed,data):
print "creating output"
for line in (line for line in data if line.startswith('#CHROM')):
output_name='{}_mask_positions.bed'.format(line.strip().split()[-1])
print output_name
output=open(output_name,'w')
print "making dictionary"
for line in bed:
line=line.strip().split()
#creating the same entry as in dict:
region='{}:{}'.format(line[0], line[1])
if region not in mydict:
output.write('{}\t{}\n'.format(line[0],line[1]))
output.close()
bed.close()
return
print "reading data"
data=sys.stdin.readlines() #.readlines here is key!!
mydict=make_vcf_dict(data)
#read the bed file:
print "Writing output"
create_output_bed(open(sys.argv[1],'r'),data)
print("--- %s seconds ---" % (time.time() - start_time))
I was wondering if there would be a more efficient way to deal with this entirely? Not making a dictionary, splitting my file? I have a 32 core server to deal with this and little experience with scripting... Thank you!
Upvotes: 1
Views: 229
Reputation: 2878
If the second file has only a few million rows (not billion as the first), then I expect the data to fit in the memory.
I have a 32 core server to deal with this
Parallelizing it won't help you much because the main bottleneck is the disk, not the CPU. Unless the data was distributed among many files on different disks.
However, you do have some improvements you can make:
motif=re.compile('\./\.')
).set
instead of dict
.format
, just use a tuple
.import sys
import re
import time
start_time = time.time()
def make_vcf(vcf_input):
output = set()
motif=re.compile('\./\.')
for line in vcf_input:
line = line.strip().split()
if line[0].startswith('#CHROM'):
output_name = '{}_mask_positions.bed'.format(line[-1])
continue
elif line[0].startswith("#"):
continue
genotype=line[-1].split(':')[0]
if motif.match(genotype) is None:
output.add( (line[0],line[1]) )
return output_name, output
def create_output_bed(output_name, vcf, bed):
print "creating output:", output_name
output = open(output_name,'w')
print "making dictionary"
for line in bed:
line = line.strip().split()
#creating the same entry as in dict:
region = line[0], line[1]
if region not in vcf:
output.write('{}\t{}\n'.format(line[0],line[1]))
output.close()
bed.close()
return
print "reading data"
output_name, vcf = make_vcf(sys.stdin.readlines())
#read the bed file:
print "Writing output"
create_output_bed(output_name, vcf, open(sys.argv[1],'r'))
print("--- %s seconds ---" % (time.time() - start_time))
Upvotes: 1