Jared
Jared

Reputation: 95

How can I improve my looping Python script, involving different mathematical operations for different conditions on each loop?

I am posting again as I had no luck trying to make the following script more efficient. For more details, do check out my previous post, but the basic situation is as below.

I have written a script in order to compute a score, as well as a frequency for a list of genetic profiles.

A genetic profile here consists of a combination of SNPs. Each SNP has two alleles. Hence, the input file for 3 SNPs is something like below, which shows all possible combinations of all alleles for all three SNPs. This table was generated using itertool's product in another script:

    AA   CC   TT
    AT   CC   TT
    TT   CC   TT
    AA   CG   TT
    AT   CG   TT
    TT   CG   TT
    AA   GG   TT
    AT   GG   TT
    TT   GG   TT
    AA   CC   TA
    AT   CC   TA
    TT   CC   TA
    AA   CG   TA
    AT   CG   TA
    TT   CG   TA
    AA   GG   TA
    AT   GG   TA
    TT   GG   TA
    AA   CC   AA
    AT   CC   AA
    TT   CC   AA
    AA   CG   AA
    AT   CG   AA
    TT   CG   AA
    AA   GG   AA
    AT   GG   AA
    TT   GG   AA

I then have another file with a table containing weights and frequencies for the three SNPs, such as below:

SNP1             A       T       1.25    0.223143551314     0.97273 
SNP2             C       G       1.07    0.0676586484738    0.3     
SNP3             T       A       1.08    0.0769610411361    0.1136  

The columns are the SNP IDs, risk allele, reference allele, OR, log(OR), and population frequency. The weights are for the risk allele.

The main script takes these two files, and computes a score, based on the sum of log odds ratios for each risk allele in each SNP for each genetic profile, as well as a frequency based on multiplying the allele frequencies, assuming Hardy Weinberg equilibrium.

import sys

snp={}
riskall={}
weights={}
freqs={}    # effect allele, *MAY NOT BE MINOR ALLELE

pop = int(int(sys.argv[4]) + 4) # for additional columns due to additional populations. the example table given only has one population (column 6)

# read in OR table
pos = 0
with open(sys.argv[1], 'r') as f:
    for line in f:
        snp[pos]=(line.split()[0])
        riskall[line.split()[0]]=line.split()[1]
        weights[line.split()[0]]=line.split()[4]
        freqs[line.split()[0]]=line.split()[pop]

        pos+=1



### compute scores for each combination
with open(sys.argv[2], 'r') as f:
    for line in f:
        score=0
        freq=1
        for j in range(len(line.split())):
            rsid=snp[j]
            riskallele=riskall[rsid]
            frequency=freqs[rsid]
            wei=weights[rsid]
            allele1=line.split()[j][0]
            allele2=line.split()[j][1]
            if allele2 != riskallele:      # homozygous for ref
                score+=0
                freq*=(1-float(frequency))*(1-float(frequency))
            elif allele1 != riskallele and allele2 == riskallele:  # heterozygous, be sure that A2 is risk allele!
                score+=float(wei)
                freq*=2*(1-float(frequency))*(float(frequency))
            elif allele1 == riskallele: # and allele2 == riskall[snp[j]]:      # homozygous for risk, be sure to limit risk to second allele!
                score+=2*float(wei)
                freq*=float(frequency)*float(frequency)

            if freq < float(sys.argv[3]):   # threshold to stop loop in interest of efficiency 
                break

        print(','.join(line.split()) + "\t" + str(score) + "\t" + str(freq))

I have set a variable where I can specify a threshold to break the loop when the frequency gets extremely low. What improvements can be done to speed up the script?

I have tried using Pandas, which is still much slower, as I am not sure if vectorization is possible in this case. I have issues installing Dask on my Unix server. I have also made sure to use only Python dictionaries and not lists, and this gave a slight improvement.

The expected output from the above would be as such:

GG,AA,GG        0       0.000286302968304
GG,AA,GA        0.0769610411361 7.33845153414e-05
GG,AA,AA        0.153922082272  4.70243735491e-06
GG,AG,GG        0.0676586484738 0.00024540254426
GG,AG,GA        0.14461968961   6.29010131498e-05
GG,AG,AA        0.221580730746  4.03066058992e-06
GG,GG,GG        0.135317296948  5.25862594844e-05
GG,GG,GA        0.212278338084  1.34787885321e-05
GG,GG,AA        0.28923937922   8.63712983555e-07
GA,AA,GG        0.223143551314  0.0204250448374
GA,AA,GA        0.30010459245   0.00523530030129
GA,AA,AA        0.377065633586  0.000335475019306
GA,AG,GG        0.290802199788  0.0175071812892
GA,AG,GA        0.367763240924  0.00448740025824
GA,AG,AA        0.44472428206   0.000287550016548
GA,GG,GG        0.358460848262  0.00375153884769
GA,GG,GA        0.435421889398  0.000961585769624
GA,GG,AA        0.512382930534  6.16178606889e-05
AA,AA,GG        0.446287102628  0.364284082594
AA,AA,GA        0.523248143764  0.0933724543834
AA,AA,AA        0.6002091849    0.00598325294334
AA,AG,GG        0.513945751102  0.312243499367
AA,AG,GA        0.590906792238  0.0800335323286
AA,AG,AA        0.667867833374  0.00512850252286
AA,GG,GG        0.581604399576  0.0669093212928
AA,GG,GA        0.658565440712  0.0171500426418
AA,GG,AA        0.735526481848  0.00109896482633

EDIT: Added previous post link, along with expected output.

Upvotes: 0

Views: 96

Answers (1)

petrch
petrch

Reputation: 1968

Disclaimer: I did not test this, it is rather a pseudo-code.

I provide some general ideas about what is slow/fast in programming and particularly in python:

You should try to move out of loops everything what is not changing in that loop. Also, in python, you should try to replace loops with comprehensions https://www.pythonforbeginners.com/basics/list-comprehensions-in-python

[ expression for item in list if conditional ]

you should try to use map/filter functions if possible and you also can prepare your data so that the program is more efficient

    rsid=snp[j]
    riskallele=riskall[rsid]

is basically a double mapping and it can possibly be done better if you can create your snp structure like this (you can use -1 index for the last column and get rid of pop):

snp = [{"riskall": line[1],"freq": float(line[4]),"weight": float(line[-1])} 
         for line in map(split,f)]

and your computing loop can become something like this:

### compute scores for each combination
stop = sys.argv[3]
with open(sys.argv[2], 'r') as f:
    for fline in f:
        score=0.0 # work with floats from the start
        freq=1.0
        line = fline.split() # do it only once

        for j,field in line:
            s=snp[j]
            riskallele=s["riskall"]
            frequency=s["freq"]
            wei=s["weight"]
            (allele1,allele2) = line[j]

            if allele2 != riskallele:      # homozygous for ref
                score+=0
                freq*=(1-frequency)*(1-frequency)
            elif allele1 != riskallele and allele2 == riskallele:  # heterozygous, be sure that A2 is risk allele!
                score+=wei
                freq*=2*(1-frequency)*frequency
            elif allele1 == riskallele: # and allele2 == riskallele:      # homozygous for risk, be sure to limit risk to second allele!
                score+=2*wei
                freq*=frequency*frequency

            if freq < stop):   # threshold to stop loop in interest of efficiency 
                break

        print(','.join(line.split()) + "\t" + str(score) + "\t" + str(freq))

The ultimate goal I would try to achieve is to convert it to some map/reduce form:

the allele can have [A,C,G,T][A,C,G,T] 16 combinations and we test against it [A,C,G,T] this only 64 combinations so I can create a map in form [AC,C] -> score,freq_function and I can get rid of the whole if block.

Sometimes the best approach is to split the code to small functions, reorganize and then merge back.

Upvotes: 1

Related Questions