MattS
MattS

Reputation: 45

Python: Speeding up script with inputs of > billion rows

I've seen a lot of tips going around on how to speed up python code or make it more efficient. I've tried some things on the code below like: change global variables to local variables, whenever possible, using .format to create strings instead of adding strings, trying to not create multiple variables. But still, this script takes 1h25 to run. I have two input files:

1) A bed file, two column (tab delimited) file with number or code in the first column, and numbers in the second column. It has ~2 billion lines, where the combination of numbers is unique (it has all the positions in a genome; the first column is the chromosome, the second is the position):

1 1

1 2

1 3

1 4

...

2) a complex file, where the first few (~3000 lines) are a header that start with #, and then an entry for, again, a combination of number/code + number in the first two columns. This two columns make the link with the first file (1 1 in file 1 is the the same as 1 1 in file 2). This has ~22 million of rows. Here is an example of the first three lines:

1   1   .   G   .   32.9939 .   DP=1;MQ0F=0;AF1=0;AC1=0;DP4=1,0,0,0;MQ=60;FQ=-29.9923   GT:PL:GQ    0/1:0:36
1   2   .   T   .   32.9939 .   DP=1;MQ0F=0;AF1=0;AC1=0;DP4=1,0,0,0;MQ=60;FQ=-29.9923   GT:PL:GQ    ./.:0:36
1   3   .   C   .   32.9939 .   DP=1;MQ0F=0;AF1=0;AC1=0;DP4=1,0,0,0;MQ=60;FQ=-29.9923   GT:PL:GQ    1/1:0:36

Question: I want to filter rows in the first file, if those rows on the second file have a 0/0, 0/1 or 1/1 (the 4th possibility is ./.) in the last column (so I need to pars the last column, to reach those three characters)

The added complexity is that file #2 has to be read through a pipe from another program because it's compressed in a specific way done by that program (opening this file takes a long time on its own but nothing I can do about it...)

Call: program view file2.vcf.gz | my_script.py file1.bed

import sys
import re
import time

start_time = time.time()


def make_vcf_dict(vcf):
    mydict={}
    for line in (line for line in vcf if not line.startswith("#")):
            line=line.strip().split()
            genotype=line[-1].split(':')[0]

            motif=re.compile('\./\.')
            if motif.match(genotype) is None:
                mydict.setdefault('{}:{}'.format(line[0],line[1]),'{}:{}'.format(line[0],line[1]))

    return mydict

def create_output_bed(bed,data):

    print "creating output"
    for line in (line for line in data if line.startswith('#CHROM')):
        output_name='{}_mask_positions.bed'.format(line.strip().split()[-1])
    print output_name
    output=open(output_name,'w') 

    print "making dictionary"   
    for line in bed:
        line=line.strip().split()
    #creating the same entry as in dict:
        region='{}:{}'.format(line[0], line[1])
        if region not in mydict:
            output.write('{}\t{}\n'.format(line[0],line[1]))
    output.close()
    bed.close()
    return

print "reading data"
data=sys.stdin.readlines()  #.readlines here is key!!

mydict=make_vcf_dict(data)

#read the bed file:
print "Writing output"
create_output_bed(open(sys.argv[1],'r'),data)

print("--- %s seconds ---" % (time.time() - start_time))

I was wondering if there would be a more efficient way to deal with this entirely? Not making a dictionary, splitting my file? I have a 32 core server to deal with this and little experience with scripting... Thank you!

Upvotes: 1

Views: 229

Answers (1)

Liran Funaro
Liran Funaro

Reputation: 2878

If the second file has only a few million rows (not billion as the first), then I expect the data to fit in the memory.

I have a 32 core server to deal with this

Parallelizing it won't help you much because the main bottleneck is the disk, not the CPU. Unless the data was distributed among many files on different disks.

However, you do have some improvements you can make:

  • Move the regex compilation outside the loop (motif=re.compile('\./\.')).
  • Use set instead of dict.
  • Avoid the format, just use a tuple.
  • Don't read all the lines beforehand.
  • Avoid going over stdin twice.
  • Avoid doing anything twice.
import sys
import re
import time

start_time = time.time()

def make_vcf(vcf_input):
    output = set()
    motif=re.compile('\./\.')

    for line in vcf_input:
        line = line.strip().split()
        if line[0].startswith('#CHROM'):
            output_name = '{}_mask_positions.bed'.format(line[-1])
            continue
        elif line[0].startswith("#"):
            continue

        genotype=line[-1].split(':')[0]

        if motif.match(genotype) is None:
            output.add( (line[0],line[1]) )

    return output_name, output


def create_output_bed(output_name, vcf, bed):
    print "creating output:", output_name
    output = open(output_name,'w') 

    print "making dictionary"   
    for line in bed:
        line = line.strip().split()
        #creating the same entry as in dict:
        region = line[0], line[1]
        if region not in vcf:
            output.write('{}\t{}\n'.format(line[0],line[1]))
    output.close()
    bed.close()
    return

print "reading data"  
output_name, vcf = make_vcf(sys.stdin.readlines())

#read the bed file:
print "Writing output"
create_output_bed(output_name, vcf, open(sys.argv[1],'r'))

print("--- %s seconds ---" % (time.time() - start_time))

Upvotes: 1

Related Questions