Python: Speeding up script with inputs of billion rows

Question

I've seen a lot of tips going around on how to speed up python code or make it more efficient. I've tried some things on the code below like: change global variables to local variables, whenever possible, using .format to create strings instead of adding strings, trying to not create multiple variables. But still, this script takes 1h25 to run. I have two input files:

1) A bed file, two column (tab delimited) file with number or code in the first column, and numbers in the second column. It has ~2 billion lines, where the combination of numbers is unique (it has all the positions in a genome; the first column is the chromosome, the second is the position):

1 1

1 2

1 3

1 4

...

2) a complex file, where the first few (~3000 lines) are a header that start with #, and then an entry for, again, a combination of number/code + number in the first two columns. This two columns make the link with the first file (1 1 in file 1 is the the same as 1 1 in file 2). This has ~22 million of rows. Here is an example of the first three lines:

1   1   .   G   .   32.9939 .   DP=1;MQ0F=0;AF1=0;AC1=0;DP4=1,0,0,0;MQ=60;FQ=-29.9923   GT:PL:GQ    0/1:0:36
1   2   .   T   .   32.9939 .   DP=1;MQ0F=0;AF1=0;AC1=0;DP4=1,0,0,0;MQ=60;FQ=-29.9923   GT:PL:GQ    ./.:0:36
1   3   .   C   .   32.9939 .   DP=1;MQ0F=0;AF1=0;AC1=0;DP4=1,0,0,0;MQ=60;FQ=-29.9923   GT:PL:GQ    1/1:0:36

Question: I want to filter rows in the first file, if those rows on the second file have a 0/0, 0/1 or 1/1 (the 4th possibility is ./.) in the last column (so I need to pars the last column, to reach those three characters)

The added complexity is that file #2 has to be read through a pipe from another program because it's compressed in a specific way done by that program (opening this file takes a long time on its own but nothing I can do about it...)

Call: program view file2.vcf.gz | my_script.py file1.bed

import sys
import re
import time

start_time = time.time()


def make_vcf_dict(vcf):
    mydict={}
    for line in (line for line in vcf if not line.startswith("#")):
            line=line.strip().split()
            genotype=line[-1].split(':')[0]

            motif=re.compile('\./\.')
            if motif.match(genotype) is None:
                mydict.setdefault('{}:{}'.format(line[0],line[1]),'{}:{}'.format(line[0],line[1]))

    return mydict

def create_output_bed(bed,data):

    print "creating output"
    for line in (line for line in data if line.startswith('#CHROM')):
        output_name='{}_mask_positions.bed'.format(line.strip().split()[-1])
    print output_name
    output=open(output_name,'w') 

    print "making dictionary"   
    for line in bed:
        line=line.strip().split()
    #creating the same entry as in dict:
        region='{}:{}'.format(line[0], line[1])
        if region not in mydict:
            output.write('{}	{}
'.format(line[0],line[1]))
    output.close()
    bed.close()
    return

print "reading data"
data=sys.stdin.readlines()  #.readlines here is key!!

mydict=make_vcf_dict(data)

#read the bed file:
print "Writing output"
create_output_bed(open(sys.argv[1],'r'),data)

print("--- %s seconds ---" % (time.time() - start_time))

I was wondering if there would be a more efficient way to deal with this entirely? Not making a dictionary, splitting my file? I have a 32 core server to deal with this and little experience with scripting... Thank you!

Python: Speeding up script with inputs of > billion rows

Answers (1)

Related Questions

Python: Speeding up script with inputs of &gt; billion rows

Answers (1)

Related Questions

Python: Speeding up script with inputs of > billion rows