Reputation: 107

Methods of improving the efficiency of a a python-loop computation

I would like to speed up the following code related to spherical modes. It is a simplification of my actual code (I didn't want to oversimplify it because it can lead to solutions that are not valid for my actual problem):

import numpy as np
import time
import math

def function_call(npp,nmax):
    matrix_a = np.random.rand(npp)
    matrix_b = np.random.rand(npp)
    a=np.random.rand()
    F = np.zeros((2*npp, 2*nmax*(nmax+2)),dtype=np.complex_) 
    npa=np.arange(npp)
    for n in range(1,nmax+1,1):
        a_n = np.sqrt(1 / (2 * np.pi * n * (n + 1)))
        for m in range(-n,n+1,1):        
            b_m = (-1)**((np.abs(m) + m) / 2)
            p_mn = int(1 / 2 * (np.abs(m) + n + 1 / 2 * (1 - (-1)**(np.abs(m) + n))))
            alpha_mn = np.sqrt(((2 * n + 1) / 2) * math.factorial(n - np.abs(m)) / math.factorial(n + np.abs(m)))
            A_mn = np.zeros(npp)
            B_mn = np.zeros(npp)
            for p in range(p_mn,n+1,1):
                Cai_pmn = math.factorial(n) * ((-1)**(n + p)) / (math.factorial(p) * math.factorial(n - p)) * math.factorial(2 * p)/math.factorial(2 * p - np.abs(m) - n)
                A_mn = A_mn + Cai_pmn * (np.cos(matrix_a))**(2 * p - np.abs(m) - n)
                B_mn = B_mn + (2 * p - np.abs(m) - n) * Cai_pmn * (np.cos(matrix_a))**(np.abs(2 * p - np.abs(m) - n - 1))
            A_mn = A_mn / (2**n * math.factorial(n))
            B_mn = B_mn / (2**n * math.factorial(n))
            S_mn = alpha_mn * m * A_mn * np.sin(matrix_a)**np.abs(np.abs(m) - 1)
            D_mn = alpha_mn * (np.abs(m) * A_mn * np.cos(matrix_a) * (np.sin(matrix_a))**(np.abs(np.abs(m) - 1)) - B_mn * (np.sin(matrix_a))**(np.abs(m) + 1))
            h1 = 1j**(n+1)*np.exp(-1j*a)/(a)
            h2 = 1j**(n)*np.exp(-1j*a)/(a)
            F_s1_theta = 1j * a_n * b_m * h1 * (S_mn * np.exp(1j * m * matrix_b))
            F_s1_phi =       -a_n * b_m * h1 * (D_mn * np.exp(1j * m * matrix_b))
            F_s2_theta =      a_n * b_m * h2 * (D_mn * np.exp(1j * m * matrix_b))
            F_s2_phi =   1j * a_n * b_m * h2 * (S_mn * np.exp(1j * m * matrix_b))
            j = 2 * (n * (n + 1) + m - 1)
            F[2 * npa, j] = F_s1_theta[npa]
            F[2 * npa+1  , j] = F_s1_phi[npa]
            j = 2 * (n * (n + 1) + m - 1) + 1
            F[2 * npa, j] = F_s2_theta[npa]
            F[2 * npa+1, j] = F_s2_phi[npa]               
prev_time_ep =time.time()
npp=500
nmax=80 
function_call(npp,nmax)           
print("       --- %s seconds ---" % (time.time() - prev_time_ep))

The first option that I have tried is to vectorize it (it took me some time because it is not obvious). However, memory consumption grows rapidly and it is not efficient.

I have also tried with Numba and I actually succeed to reduce by 4 the previous time, but I was looking for larger improvements if it is possible.

I have read also that maybe multiprocessing or Cython could be good options. Maybe there is a way of vectorizing it without growing rapidly in memory usage?

Upvotes: 1

Answers (3)

Glauco

Reputation: 1463

first of all your code has O(n^3) complexity that is not a big deal.

A lot of this code can be done using array programming that will speed up a lot expecially using numpy.

I suggest to use a profiler to find not performing code and start to rewrite the code in vectrial out of loops.

I wrote a tool perf_tools useful for this kind of works. It can guide you in a sort of performance driven solution.

Upvotes: 0

Glauco

Reputation: 1463

I've worked a little bit on your code, here the benchmark. the bottleneck is in the factorial computation.

    ================== PerfTool ==================
task                     |aver(s) |sum(s)  |count   |std     
main loop                |   0.134|  10.712|      80|   0.101
  +-second loop          |   0.134|  10.712|      80|   0.101
    +-A                  |   0.000|   0.245|    6560|   0.000
    +-B                  |   0.001|   5.648|    6560|   0.001
    +-C                  |   0.000|   0.541|    6560|   0.000
    +-D                  |   0.000|   1.505|    6560|   0.000
    +-E                  |   0.000|   1.769|    6560|   0.000
    +-F                  |   0.000|   0.867|    6560|   0.000
mx creation              |   0.000|   0.000|       1|   0.000
preparation              |   0.000|   0.000|       1|   0.000

overall                  |    0.03|   10.71|   39522|-

The B and C sentinel are:

      with PerfTool('B'):
                for p in range(p_mn,n+1,1):
                    Cai_pmn = math.factorial(n) * ((-1)**(n + p)) / (math.factorial(p) * math.factorial(n - p)) * math.factorial(2 * p)/math.factorial(2 * p - np.abs(m) - n)
                    A_mn = A_mn + Cai_pmn * (np.cos(matrix_a))**(2 * p - np.abs(m) - n)
                    B_mn = B_mn + (2 * p - np.abs(m) - n) * Cai_pmn * (np.cos(matrix_a))**(np.abs(2 * p - np.abs(m) - n - 1))

      with PerfTool('C'):
                A_mn = A_mn / (2**n * math.factorial(n))
                B_mn = B_mn / (2**n * math.factorial(n))

As you can see most of the time is spent on B, so i've add a kind of cache, here:

   rng = np.arange(1,nmax+1,1)
    cache = dict(zip(rng,factorial(rng)))
    def get_factorial(w,cache=cache):
        if w not in cache:
            cache[w] = math.factorial(w)
        return cache[w]

To use instead of math.factorial, this avoid to recalculate same values.

Finally, B was refactored as B_vec, and here there is the root of the evil! i've marked that code as B_vec_slow, that 2 lines takes up most of the time.

            with PerfTool('B_vec'):
                prng = np.arange(p_mn, n+1)
                Cai_pmn_vec = get_factorial(n) * ((-1)**(n + prng)) / (factorial(prng) * factorial(n - prng)) * factorial(2 * prng)/factorial(2 * prng - np.abs(m) - n)
                with PerfTool('B_vec_slow'):
                    A_mn_vec = Cai_pmn_vec*np.power(cos_matrix_a[:,np.newaxis],2 * prng - np.abs(m) - n)
                    B_mn_vec = (2 * prng - np.abs(m) - n) * Cai_pmn_vec * np.power(cos_matrix_a[:,np.newaxis], np.abs(2 * prng - np.abs(m) - n - 1))
                A_mn = np.sum(A_mn_vec,axis=1)
                B_mn = np.sum(B_mn_vec,axis=1)

This is the result:

================== PerfTool ==================
task                     |aver(s) |sum(s)  |count   |std     
main loop                |   0.072|   5.736|      80|   0.052
  +-second loop          |   0.072|   5.735|      80|   0.052
    +-A                  |   0.000|   0.194|    6560|   0.000
    +-B_vec              |   0.001|   3.490|    6560|   0.000
      +-B_vec_slow       |   0.000|   2.987|    6560|   0.000
    +-C                  |   0.000|   0.126|    6560|   0.000
    +-D                  |   0.000|   0.536|    6560|   0.000
    +-E                  |   0.000|   0.768|    6560|   0.000
    +-F                  |   0.000|   0.522|    6560|   0.000
preparation              |   0.000|   0.000|       1|   0.000
mx creation              |   0.000|   0.000|       1|   0.000

overall                  |    0.01|    5.74|   46082|-

If you can work on these 2 lines you can expect to run in 2/3 sec.

HERE: the optimized code: https://www.codepile.net/pile/8oDyGp6Q

Upvotes: 1

Miguel Herrera Ruiz

Reputation: 134

I'm not sure about which loops are faster than others in Python, but as I can see in your code, there are some duplicated function calls (same arguments).

For instance:

A_mn = A_mn / (2**n * math.factorial(n))
B_mn = B_mn / (2**n * math.factorial(n))

Both declarations share same denominator. Try to calculate first the factorial, save it into a variable and use it as denominator.

Also, np.exp() function is called multiple times, all with the same arguments:

np.exp(-1j*a)/(a)
np.exp(1j * m * matrix_b)

Numpy exponential function can be a little slow in execution time.

The same thing happens with all np.sin / np.cos(matrix_a) calls.

These changes won't provide you a huge improvement, but it's something:

def function_call_2(npp,nmax):
    matrix_a = np.random.rand(npp)
    matrix_b = np.random.rand(npp)
    a=np.random.rand()
    F = np.zeros((2*npp, 2*nmax*(nmax+2)),dtype=np.complex_) 
    npa=np.arange(npp)

    # Auxiliary variables
    sin_matrix_a = np.sin(matrix_a)
    cos_matrix_a = np.cos(matrix_a)

    for n in range(1,nmax+1,1):

        # Auxilary variables
        denominator = (2**n * math.factorial(n))

        a_n = np.sqrt(1 / (2 * np.pi * n * (n + 1)))
        for m in range(-n,n+1,1):        
            b_m = (-1)**((np.abs(m) + m) / 2)
            p_mn = int(1 / 2 * (np.abs(m) + n + 1 / 2 * (1 - (-1)**(np.abs(m) + n))))
            alpha_mn = np.sqrt(((2 * n + 1) / 2) * math.factorial(n - np.abs(m)) / math.factorial(n + np.abs(m)))
            A_mn = np.zeros(npp)
            B_mn = np.zeros(npp)
            for p in range(p_mn,n+1,1):
                Cai_pmn = math.factorial(n) * ((-1)**(n + p)) / (math.factorial(p) * math.factorial(n - p)) * math.factorial(2 * p)/math.factorial(2 * p - np.abs(m) - n)
                A_mn = A_mn + Cai_pmn * (cos_matrix_a)**(2 * p - np.abs(m) - n)
                B_mn = B_mn + (2 * p - np.abs(m) - n) * Cai_pmn * (cos_matrix_a)**(np.abs(2 * p - np.abs(m) - n - 1))
            A_mn = A_mn / denominator
            B_mn = B_mn / denominator
            S_mn = alpha_mn * m * A_mn * sin_matrix_a**np.abs(np.abs(m) - 1)
            D_mn = alpha_mn * (np.abs(m) * A_mn * cos_matrix_a * (sin_matrix_a)**(np.abs(np.abs(m) - 1)) - B_mn * (sin_matrix_a)**(np.abs(m) + 1))

            # Auxilary variables
            np_exp_1 = np.exp(-1j*a)/(a)

            np_exp_2 = np.exp(1j * m * matrix_b)

            h1 = 1j**(n+1)*np_exp_1
            h2 = 1j**(n)*np_exp_1

            F_s1_theta = 1j * a_n * b_m * h1 * (S_mn * np_exp_2)
            F_s1_phi =       -a_n * b_m * h1 * (D_mn * np_exp_2)
            F_s2_theta =      a_n * b_m * h2 * (D_mn * np_exp_2)
            F_s2_phi =   1j * a_n * b_m * h2 * (S_mn * np_exp_2)
            j = 2 * (n * (n + 1) + m - 1)
            F[2 * npa, j] = F_s1_theta[npa]
            F[2 * npa+1  , j] = F_s1_phi[npa]
            j = 2 * (n * (n + 1) + m - 1) + 1
            F[2 * npa, j] = F_s2_theta[npa]
            F[2 * npa+1, j] = F_s2_phi[npa]

I made a simple test, and I get a ~3 seconds faster execution:

prev_time_ep =time.time()
npp=500
nmax=80 
function_call(npp,nmax)           
print("--- %s seconds ---" % (time.time() - prev_time_ep))

prev_time_ep =time.time()
function_call_2(npp,nmax)           
print("--- %s seconds ---" % (time.time() - prev_time_ep))

Output:

--- 16.99950885772705 seconds ---
--- 14.231853580474854 seconds ---

Upvotes: 0

Methods of improving the efficiency of a a python-loop computation

Answers (3)

Related Questions