Reputation: 232
I have a function in Numba with several loops that could be parallelized. The loop writes to a common array, K, so I understand that the compiler might not be optimizing as much as it could. However, I have no feel for what will make numba's jit compiler create the most efficient code. The examples in the documentation are too simplistic to be of help.
I have tried changing each of the range
into prange
. The best results were obtained when parallelizing the loop over k_x
, but I only got a 2.8x improvement on a machine with 4 cores. I know that I should not expect linear performance improvements, but I have the feeling that I should get better results in this case. For example, I get slightly better results using dask, x_cond.map_blocks(cond_expect_kernel, x_tr, *args)
which is odd considering the scheduler overhead.
Is there a way to improve the parallelization of this function beyond simply changing range
into prange
?
Original function
@jit(float64[:,:](float64[:,:], float64[:,:], int64, int64), nopython=True, nogil=True)
def cond_expect_kernel(x_cond, x_tr, degree, amount_non_cond_vars):
size = x_cond.shape[1]
x_tr_cond = x_tr[:, :size]
samples_x = x_cond.shape[0]
samples_tr = x_tr.shape[0]
K = (1+np.dot(x_cond, x_tr_cond.T))**degree
for j in range(size, size+amount_non_cond_vars):
for k_x in range(samples_x):
for k_x_tr in range(samples_tr):
K[k_x, k_x_tr] += x_tr[k_x_tr, j]**2*3
for j_left in range(size):
K[k_x, k_x_tr] += x_cond[k_x, j_left]*x_tr[k_x_tr, j_left]*x_tr[k_x_tr, j] ** 2 *3
return K
Best parallel version so far:
@jit(float64[:,:](float64[:,:], float64[:,:], int64, int64), nopython=True, parallel=True)
def cond_expect_kernel_parallel(x_cond, x_tr, degree, amount_non_cond_vars):
size = x_cond.shape[1]
x_tr_cond = x_tr[:, :size]
samples_x = x_cond.shape[0]
samples_tr = x_tr.shape[0]
K = (1+np.dot(x_cond, x_tr_cond.T))**degree
for j in range(size, size+amount_non_cond_vars):
for k_x in prange(samples_x):
for k_x_tr in range(samples_tr):
K[k_x, k_x_tr] += x_tr[k_x_tr, j]**2*3
for j_left in range(size):
K[k_x, k_x_tr] += x_cond[k_x, j_left]*x_tr[k_x_tr, j_left]*x_tr[k_x_tr, j] ** 2 *3
return K
For reference, I am working on a machine with 4 cores and another with 16 cores. samples_x
is around 100,000, samples_tr
around 50000, size
around 3, and amount_non_cond_vars
around 100.
Thanks!
Upvotes: 1
Views: 626
Reputation: 6482
Your code has a few performance critical issues.
nb.float64[:,::1]
. This often prevents SIMD- vectorization and can in many cases lead to lower performance. If you are not sure if the arrays are C-contigious simply don't the declare it, Numba can do this itself.size
is quite often 3, you can write a specialised version for this problem. (In this case manual loop unrolling). You can check with a small wrapper function if the specialised case occurs.Examples
import numpy as np
import time
import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')
@nb.njit(nb.float64[:,:](nb.float64[:,:], nb.float64[:,:], nb.int64, nb.int64),fastmath=True,parallel=True)
def cond_expect_kernel_gen(x_cond, x_tr, degree, amount_non_cond_vars):
x_tr_cond = x_tr[:,:x_cond.shape[1]]
K = np.dot(x_cond, x_tr_cond.T)
for k_x in nb.prange(x_cond.shape[0]):
for k_x_tr in range(x_tr.shape[0]):
sum=(K[k_x, k_x_tr]+1)**degree
for j in range(x_cond.shape[1], x_cond.shape[1]+amount_non_cond_vars):
sum += x_tr[k_x_tr, j]**2*3
for j_left in range(x_cond.shape[1]):
sum += x_cond[k_x, j_left]*x_tr[k_x_tr, j_left]*x_tr[k_x_tr, j] ** 2 *3
K[k_x, k_x_tr]=sum
return K
@nb.njit(nb.float64[:,::1](nb.float64[:,::1], nb.float64[:,::1], nb.int64, nb.int64),fastmath=True,parallel=True)
def cond_expect_kernel_3(x_cond, x_tr, degree, amount_non_cond_vars):
assert x_cond.shape[1]==3
x_tr_cond = x_tr[:,:x_cond.shape[1]]
K = np.dot(x_cond, x_tr_cond.T)
for k_x in nb.prange(x_cond.shape[0]):
for k_x_tr in range(x_tr.shape[0]):
sum=(K[k_x, k_x_tr]+1)**degree
for j in range(x_cond.shape[1], x_cond.shape[1]+amount_non_cond_vars):
sum += x_tr[k_x_tr, j]**2*3
sum_2=0.
sum_2 += x_cond[k_x, 0]*x_tr[k_x_tr, 0]
sum_2 += x_cond[k_x, 1]*x_tr[k_x_tr, 1]
sum_2 += x_cond[k_x, 2]*x_tr[k_x_tr, 2]
sum+=sum_2*x_tr[k_x_tr, j] ** 2 *3
K[k_x, k_x_tr]=sum
return K
Performance
x_cond=np.random.rand(10_000,3)
x_tr=np.random.rand(5_000,103)
amount_non_cond_vars=100
degree=3
t1=time.time()
res_1=cond_expect_kernel_gen(x_cond, x_tr, degree, amount_non_cond_vars)
print(time.time()-t1)
t1=time.time()
res_2=cond_expect_kernel_3(x_cond, x_tr, degree, amount_non_cond_vars)
print(time.time()-t1)
(Quadcore i7, Numba 0.40dev)
your version, single threaded: 40s
your version, parallel: 8.61s
mod_general:3.8s
mod_3: 1.35s
Upvotes: 1