velenos14
velenos14

Reputation: 566

parallel numba code on 20 cores slower than serial

The function I want to profile and see it has improved after adding the parallel=True keyword (potentially with nogil=True as well) is quite simple (given the preamble is loaded (some function definitions, some global variables, and two variables which are load data from the disk (two .npy's)) and reads as below. The profiling is as and I run the .py on 20 cores as NUMBA_NUM_THREADS=20 python main.py:

if __name__ == "__main__":
    from timeit import Timer
    t = Timer(lambda: get_Cilmoft(Psi))    
    print( min( t.repeat(repeat=2, number=2) ) )

And the function of interest inside main.py is:

@njit(parallel=True, fastmath=True)
def get_Cilmoft(Psi):

    Cilmoft = np.zeros( (gl.N_rs - 1,  gl.N_thetas,  2*gl.N_thetas-1,  2), dtype=np.complex128)
    for i in prange(Cilmoft.shape[0]):
        for l in prange(Cilmoft.shape[1]):
            for m in prange(-l, l+1, 1):

                exp_of_phis = np.exp(-1j * m * gl.phis)  # a temporary array? allocation contention problem?

                for contor in prange(Psi.shape[2]):
                    Psi[:, :, contor] = Psi[:, :, contor] * exp_of_phis[contor]

                F1 = np.zeros( (Psi.shape[0], Psi.shape[1]), dtype=np.complex128) # a temporary array? allocation contention problem?
                for a in prange(Psi.shape[0]):
                    for b in prange(Psi.shape[1]):
                        F1[a, b] = np.trapz(Psi[a, b, :])

                poly_results = eval_legendre_withNJIT(l)
                F2 = np.zeros( (gl.N_rs - 1, ), dtype=np.complex128) # a temporary array? allocation contention problem?

                for idx in prange(gl.N_rs - 1):
                    F2[idx] = np.sum(weights_here * poly_results * F1[idx, :])

                Cilmoft[i, l, m+l, 0] = np.complex128(np.sum(eigenvectors_memory[:, i, l] * F2 * lobatto_weights))

    return Cilmoft

The preamble for a MWE is:

import numpy as np
import numba
from numba import njit, vectorize, prange
from scipy.special import roots_legendre, legendre, eval_legendre, lpmv


import gl

Psi = np.empty((80, 20, 18), dtype=np.complex128)

eigenenergies_memory = np.load("eigenValues_N_rs80_alpha25.0_rmax200.0.npy") # shall be 2D array shape (gl.N_rs - 1,   gl.N_thetas)
eigenvectors_memory = np.load("eigenVectors_N_rs80_alpha25.0_rmax200.0.npy") # shall be 3D array shape  (gl.N_rs - 1,   gl.N_rs - 1,   gl.N_thetas), first index is the r-dep, second index is the i-index, third index is the l-index
def gLLNodesAndWeights (n, epsilon = 1e-15):
    x = np.empty (n)
    w = np.empty (n)
    x[0] = -1
    x[n - 1] = 1
    w[0] = 2.0 / ((n * (n - 1)))
    w[n - 1] = w[0]
    n_2 = n // 2
    for i in range (1, n_2):
        xi = (1 - (3 * (n - 2)) / (8 * (n - 1) ** 3)) *\
           np.cos ((4 * i + 1) * np.pi / (4 * (n - 1) + 1))
        error = 1.0
        while error > epsilon:
            y  =  dLgP (n - 1, xi)
            y1 = d2LgP (n - 1, xi)
            y2 = d3LgP (n - 1, xi)
            dx = 2 * y * y1 / (2 * y1 ** 2 - y * y2)
            xi -= dx
            error = abs (dx)
        x[i] = -xi
        x[n - i - 1] =  xi
        w[i] = 2 / (n * (n - 1) * lgP (n - 1, x[i]) ** 2)
        w[n - i - 1] = w[i]
    if n % 2 != 0:
        x[n_2] = 0
        w[n_2] = 2.0 / ((n * (n - 1)) * lgP (n - 1, np.array (x[n_2])) ** 2)
    return x, w
def dLgP (n, xi):
  return n * (lgP (n - 1, xi) - xi * lgP (n, xi))\
           / (1 - xi ** 2)
def d2LgP (n, xi):
  return (2 * xi * dLgP (n, xi) - n * (n + 1)\
                                    * lgP (n, xi)) / (1 - xi ** 2)
def d3LgP (n, xi):
  return (4 * xi * d2LgP (n, xi)\
                 - (n * (n + 1) - 2) * dLgP (n, xi)) / (1 - xi ** 2)
def lgP (n, xi):
  if n == 0:
    return np.ones (xi.size)
  elif n == 1:
    return xi
  else:
    fP = np.ones (xi.size); sP = xi.copy (); nP = np.empty (xi.size)
    for i in range (2, n + 1):

      nP = ((2 * i - 1) * xi * sP - (i - 1) * fP) / i
      fP = sP; sP = nP
    return nP
a = gLLNodesAndWeights(gl.N_rs + 1)[0]
roots = a[1:-1]
lobatto_nodes = roots
b =  roots_legendre(gl.N_thetas)
roots_here, weights_here = np.float64(b[0]), np.float64(b[1])

legendre_results_for_Psi_lm = np.zeros((roots.shape[0], ), dtype=np.complex128)
for j in range(roots.shape[0]):
    legendre_results_for_Psi_lm[j] = eval_legendre(gl.N_rs, roots[j])

@njit
def eval_legendre_for_Psi_lm(j):
    return legendre_results_for_Psi_lm[j]

legendre_results = np.zeros( (gl.N_thetas, roots_here.shape[0]) , dtype=np.complex128)
for l in range(gl.N_thetas):
    legendre_results[l, :] = eval_legendre(l, roots_here)
@njit
def eval_legendre_withNJIT(l):
    return legendre_results[l, :]

@njit
def r(x):
    return gl.L * (1 + x) / (1 - x + gl.alpha)
@njit
def r_prime(x):
    return gl.L * (2 + gl.alpha) / (1 - x + gl.alpha)**2

lobatto_weights = 2 / ( (gl.N_rs) * (gl.N_rs+1) * eval_legendre(gl.N_rs, lobatto_nodes)**2 )

The gl.py is:

import numpy as np

N_rs = np.int32(80)
N_thetas = 20
N_phis = 18
phis = np.linspace(0.0, 2*np.pi, num = N_phis)

delta_theta = np.float64(np.pi / N_thetas)
delta_phi = np.float64(2*np.pi / N_phis)

N_timesteps = np.int(500)

r_max = np.float64(200.0)
alpha = np.float64(25)
L = alpha * r_max / 2

Temp_Env = "sin_squared"
N = np.int32(12)

delta_t = 0.1
omega = 0.057 # equivalent to lambda = 800 nm

c_au = 137.036

The eigenenergies_memory and eigenvectors_memory are obtained from (just run the below and it will save them to the disk):

import numpy as np
import scipy
from matplotlib import pyplot as plt

from scipy.special import roots_legendre, legendre
from numpy.linalg import eig

import gl

def gLLNodesAndWeights (n, epsilon = 1e-15):

    x = np.empty (n)
    w = np.empty (n)
    
    x[0] = -1 
    x[n - 1] = 1
    w[0] = 2.0 / ((n * (n - 1))) 
    w[n - 1] = w[0]
    
    n_2 = n // 2
    
    for i in range (1, n_2):
        xi = (1 - (3 * (n - 2)) / (8 * (n - 1) ** 3)) *\
           np.cos ((4 * i + 1) * np.pi / (4 * (n - 1) + 1))
      
        error = 1.0
      
        while error > epsilon:
            y  =  dLgP (n - 1, xi)
            y1 = d2LgP (n - 1, xi)
            y2 = d3LgP (n - 1, xi)
        
            dx = 2 * y * y1 / (2 * y1 ** 2 - y * y2)
        
            xi -= dx
            error = abs (dx)
      
        x[i] = -xi
        x[n - i - 1] =  xi
      
        w[i] = 2 / (n * (n - 1) * lgP (n - 1, x[i]) ** 2)
        w[n - i - 1] = w[i]

    if n % 2 != 0:
        x[n_2] = 0
        w[n_2] = 2.0 / ((n * (n - 1)) * lgP (n - 1, np.array (x[n_2])) ** 2)

    return x, w

def dLgP (n, xi):

  return n * (lgP (n - 1, xi) - xi * lgP (n, xi))\
           / (1 - xi ** 2)

def d2LgP (n, xi):

  return (2 * xi * dLgP (n, xi) - n * (n + 1)\
                                    * lgP (n, xi)) / (1 - xi ** 2)

def d3LgP (n, xi):

  return (4 * xi * d2LgP (n, xi)\
                 - (n * (n + 1) - 2) * dLgP (n, xi)) / (1 - xi ** 2)

def lgP (n, xi):

  if n == 0:
    return np.ones (xi.size)
  
  elif n == 1:
    return xi

  else:
    fP = np.ones (xi.size); sP = xi.copy (); nP = np.empty (xi.size)
    for i in range (2, n + 1):

      nP = ((2 * i - 1) * xi * sP - (i - 1) * fP) / i
      fP = sP; sP = nP

    return nP


a = gLLNodesAndWeights(gl.N_rs + 1)[0]
roots = a[1:-1] # has shape (gl.N_rs - 1) if input to gLLNodesAndWeights is equal to (gl.N_rs + 1)




# Non-Linear Mapping 
def r(x):
    return gl.L * (1 + x) / (1 - x + gl.alpha)

def r_prime(x):
    return gl.L * (2 + gl.alpha) / (1 - x + gl.alpha)**2


# H_{l}^{0}(x) assembly functions
def V_l(x, l, Z):
    return ( l*(l+1) / (2*r(x)**2) ) + Coulomb(r(x), Z)

def Coulomb(r, Z):
    return (-Z / r)


Z = 1
H = np.zeros(  (gl.N_rs - 1,  gl.N_rs - 1,  gl.N_thetas),   dtype=np.complex128)
pref = gl.N_rs * (gl.N_rs + 1) / 6
diag_values_Kinetic = np.array([  (1/r_prime(x)) * pref * (1/(1-x**2)) * (1/r_prime(x)) for x in roots])

for l in range(gl.N_thetas):
    diag_values_Coulomb_and_Laser = np.array([  V_l(x, l, Z) for x in roots ])
    np.fill_diagonal( H[:, :, l],   diag_values_Kinetic + diag_values_Coulomb_and_Laser )
    for i in range(H.shape[0]):
        for j in range(H.shape[1]):
            if (i != j):
                H[i, j, l] = (1/r_prime(roots[i])) * (1/r_prime(roots[j])) * (1/(roots[i]-roots[j])**2) 
            # else already filled by np.fill_diagonal() from above

eig_values = np.zeros( (gl.N_rs - 1,  gl.N_thetas), dtype=np.complex128)
eig_vectors = np.zeros( (gl.N_rs - 1,  gl.N_rs - 1,  gl.N_thetas), dtype = np.complex128)

for l in range(gl.N_thetas):
    eig_values[:, l], eig_vectors[:, :, l] = eig( H[:, :, l] )

np.save("eigenVectors_N_rs{}_alpha{}_rmax{}.npy".format(gl.N_rs, gl.alpha, gl.r_max), eig_vectors)
np.save("eigenValues_N_rs{}_alpha{}_rmax{}.npy".format(gl.N_rs, gl.alpha, gl.r_max), eig_values)

What I have tried in order to optimize get_Cilmoft(Psi):

  1. I found from stackoverflow that allocating small arrays inside a prange for-loop means a standard allocator contention, and is to be avoided. I thus made the code to allocate F1 and F2 immediately below the allocation of Cilmoft. There has been no success in increasing the performance (time to solution) of this function when run on 20 cores by making those allocations outside of the nested for-loops.
  2. I have played with the prange keywords, by letting only the outermost for-loop have prange and the others have the simple range. That would be for i in prange(Cilmoft.shape[0]): followed by all the other for loops using range only. No massive improvement.
  3. I have tried adding nogil=True, a very minor change in execution time.
  4. I have tried removing the parallel=True and a massive improvement in time to solution occurred when running the .py as before. It is thus clear that I am doing something wrong, or I simply cannot make this piece of code work better in parallel?
  5. Might it be that @njit "does not like" np.sum() and np.trapz()? I tried to replace the two np.sum() calls by:
for idx in prange(80 - 1):
    # F2[idx] = np.sum(weights_here * poly_results * F1[idx, :])
    int_res = weights_here * poly_results * F1[idx, :]
    for n in range(int_res.shape[0]):
        F2[idx] += int_res[n]

and by:

intermediate_res  = eigenvectors_memory[:, i, l] * F2 * lobatto_weights
for k in range(intermediate_res.shape[0]):
    Cilmoft[i, l, m+l, 0] += np.complex128(intermediate_res[k])
  1. I have put the number of iterations of the for-loops to be constants and not Psi.shape[0] or Psi.shape[1] or any other such construct. Now the for-loops are similar to: for i in range(80 - 1): ... , i.e. manually introduced ints.

  2. I do not allocate the temporary arrays exp_of_phis and poly_results inside the triple nested for-loop (i, l, m), but I pre-allocate them outside the function as:

    exp_of_phis_matrix = np.zeros( (19*2+1, 18), dtype=np.complex128)
    for m in range(-19, 20, 1):
       exp_of_phis_matrix[m+19, :] = np.exp(-1j * m * gl.phis)
    poly_results_vector = eval_legendre_withNJIT(np.array([[l for l in range(20)]]).reshape(20, ))

Then use them as:

for contor in range(18):
    Psi[:, :, contor] = Psi[:, :, contor] * exp_of_phis_matrix[m+19, contor]

and as:

for idx in prange(80 - 1):
    F2[idx] = np.sum(weights_here * poly_results_vector[l] * F1[idx, :])

This results in the first visible improvement after everything I tried. Still, the running with parallel=True is significantly slower (1 order of magnitude at least) than running without that keyword set.

Does anyone know what I could try/read for/as the next steps?

Edit after user_na comment: I now do to profile:

if __name__ == "__main__":
    from timeit import Timer
 # call each function which will be used inside get_Cilomft(Psi) to avoid timing their compilation times

    eval_legendre_withNJIT(2)
    get_Cilmoft(Psi)

    t = Timer(lambda: get_Cilmoft(Psi))
    print( min( t.repeat(repeat=2, number=2) ) )

And I get the following. I varied gl.N_r, i.e. Psi.shape[0] to be {40, 80, 160, 320, 640} and I let run both the parallel funct but also the serial funct on those Psi's. For the parallel funct runs, I obtained better results if I kept only the first for loop with the prange keyword. First time duration is the parallel funct having all its for loops with prange, second time duration is the parallel funct having only its first for loop with the prange keyowrd, third time duration is the serial function. An underscore means that I do not have those times yet (simulations are in progress).

Update: I re-wrote the np.trapz and np.sum functions my own and @njit-ted them (only with the argument fastmath=True) and used in the places where I was using np.trapz and np.sum(). Now the times are as following (for serial and parallel):

Thank you!

Upvotes: 1

Views: 390

Answers (1)

user_na
user_na

Reputation: 2273

My comment is getting lengthy so I will write is as an answer instead.

Generally it is importent to check if there is an offset, so something which is scaling like O(1) which is making your alg. seems worse than it actually is.

First thing is the constant compile time. To be 100% sure you could explicitly calling it once before profiling. Although, this is probably taken care of by your min().

get_Cilmoft(Psi)
t = Timer(lambda: get_Cilmoft(Psi))    
print( min( t.repeat(repeat=2, number=2) ) )

Second thing is that there is an overhead of the parallelized function to set up and organize the threads. Repeating the measurement will not reduce this time. So have reasonable big data sets to make this O(1) part small in comparison to the computing time. Also it is advised to check how the parallelized version scales with the size of Psi and compare it to the reference (original) one.

If your Psi is not big enough, it might be the case that you loose your performance by the overhead of the parallelized version.

Check the timings for different sizes of Psi, are the two version are scaling similarly with the size of Psi? With "scaling like" I mean to comparte the fraction time for Psi of size 2N / time for Psi of size N for these two implementations. With the same N and 2N for both implementations.

If you share this info, we can dig deeper into possible root cause of your observation.

[update]

The bahavior is very strange indeed. It gets a factor of 2 faster just to get slower again when you increase the data once again. I am not sure if I will be a big help, but here is one thing which is very odd in your code.

def get_Cilmoft(Psi):
    ...
    for i in prange(Cilmoft.shape[0]):
        for l in prange(Cilmoft.shape[1]):
            for m in prange(-l, l+1, 1):
                ...
                for contor in prange(Psi.shape[2]):
                    Psi[:, :, contor] = Psi[:, :, contor] * exp_of_phis[contor]

In every of the parallel loops you change the content of your input variable. I don't know how the compiler deals with this, but this could very well be the problem. It might detect this as a dependency between the iterations of the loop and skip paralization here. Try to use a temporary variable to save the result instead.

I am also not sure how numba handles write access to variables. It could be that it creates semaphores so all parallel threads queue and wait again to write their data back. This can happen if the individual calculations are too small. Try to comment out the writing into variables at different places in your code and check how this affects your calculation time.

Hope this helps at least a bit.

Upvotes: 2

Related Questions