Reputation: 340
I have a function that operates on a 2D matrix on float64(x,y). Basic concept: for each combination of rows (no. rows choose 2) count the number of positiv values after subtraction (row1 - row2). In a 2Dmatrix of int64(y,y) store this value in index [row1,row2] if value is above a certain threshold and [row2,row1] if below.
I've implemented that and decorated it with @njit(parallel=False), that works fine @njit(parallel=True) seems to give no speedup. Trying to speed up the whole thing I had a look at @guvectorize, that works as well. However I'm not able to figure out how to use @guvectorize with parallel true in this case either.
I had a look at numba guvectorize target='parallel' slower than target='cpu' , where the solution was to use @vecorize instead, but I can not transfer the solution to my problem, therefore I am now seeking help :)
Basic jitted and guvectorized implementation
import numpy as np
from numba import jit, guvectorize, prange
import timeit
@jit(parallel=False)
def check_pairs_sg(raw_data):
# 2D array to be filled
result = np.full((len(raw_data), len(raw_data)), -1)
# Iterate over all possible gene combinations
for r1 in range(0, len(raw_data)):
for r2 in range(r1+1, len(raw_data)):
diff = np.subtract(raw_data[:, r1], raw_data[:, r2])
num_pos = len(np.where(diff > 0)[0])
# Arbitrary check to illustrate
if num_pos >= 5:
result[r1,r2] = num_pos
else:
result[r2,r1] = num_pos
return result
@jit(parallel=True)
def check_pairs_multi(raw_data):
# 2D array to be filled
result = np.full((len(raw_data), len(raw_data)), -1)
# Iterate over all possible gene combinations
for r1 in range(0, len(raw_data)):
for r2 in prange(r1+1, len(raw_data)):
diff = np.subtract(raw_data[:, r1], raw_data[:, r2])
num_pos = len(np.where(diff > 0)[0])
# Arbitrary check to illustrate
if num_pos >= 5:
result[r1,r2] = num_pos
else:
result[r2,r1] = num_pos
return result
@guvectorize(["void(float64[:,:], int64[:,:])"],
"(n,m)->(m,m)", target='cpu')
def check_pairs_guvec_sg(raw_data, result):
for r1 in range(0, len(result)):
for r2 in range(r1+1, len(result)):
diff = np.subtract(raw_data[:, r1], raw_data[:, r2])
num_pos = len(np.where(diff > 0)[0])
# Arbitrary check to illustrate
if num_pos >= 5:
result[r1,r2] = num_pos
else:
result[r2,r1] = num_pos
@guvectorize(["void(float64[:,:], int64[:,:])"],
"(n,m)->(m,m)", target='parallel')
def check_pairs_guvec_multi(raw_data, result):
for r1 in range(0, len(result)):
for r2 in range(r1+1, len(result)):
diff = np.subtract(raw_data[:, r1], raw_data[:, r2])
num_pos = len(np.where(diff > 0)[0])
# Arbitrary check to illustrate
if num_pos >= 5:
result[r1,r2] = num_pos
else:
result[r2,r1] = num_pos
if __name__=="__main__":
np.random.seed(404)
a = np.random.random((512,512)).astype(np.float64)
res = np.full((len(a), len(a)), -1)
and measured with
%timeit check_pairs_sg(a)
%timeit check_pairs_multi(a)
%timeit check_pairs_guvec_sg(a, res)
%timeit check_pairs_guvec_multi(a, res)
resulting in:
614 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
507 ms ± 6.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
622 ms ± 3.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
671 ms ± 4.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I cat wrap my head around on how to implement this as @vectorized or a proper parallel @guvectorize to fill the resulting 2D array truely in parallel.
I guess this is my first step before trying to taking this further to gpu.
Any help is highly appreciated.
Upvotes: 4
Views: 5290
Reputation: 6482
For example think of a more or less exact equivalent implementation of the lines
diff = np.subtract(raw_data[:, r1], raw_data[:, r2])
num_pos = len(np.where(diff > 0)[0])
in C++.
Pseudo Code
The main problems in your code are:
Removing temporary arrays and simplification
@nb.njit(parallel=False)
def check_pairs_simp(raw_data):
# 2D array to be filled
result = np.full((raw_data.shape[0],raw_data.shape[1]), -1)
# Iterate over all possible gene combinations
for r1 in range(0, raw_data.shape[1]):
for r2 in range(r1+1, raw_data.shape[1]):
num_pos=0
for i in range(raw_data.shape[0]):
if (raw_data[i,r1]>raw_data[i,r2]):
num_pos+=1
# Arbitrary check to illustrate
if num_pos >= 5:
result[r1,r2] = num_pos
else:
result[r2,r1] = num_pos
return result
Removing temporary arrays and simplification + contigous memory access
@nb.njit(parallel=False)
def check_pairs_simp_rev(raw_data_in):
#Create a transposed array not just a view
raw_data=np.ascontiguousarray(raw_data_in.T)
# 2D array to be filled
result = np.full((raw_data.shape[0],raw_data.shape[1]), -1)
# Iterate over all possible gene combinations
for r1 in range(0, raw_data.shape[0]):
for r2 in range(r1+1, raw_data.shape[0]):
num_pos=0
for i in range(raw_data.shape[1]):
if (raw_data[r1,i]>raw_data[r2,i]):
num_pos+=1
# Arbitrary check to illustrate
if num_pos >= 5:
result[r1,r2] = num_pos
else:
result[r2,r1] = num_pos
return result
Removing temporary arrays and simplification + contigous memory access + Parallelization
@nb.njit(parallel=True,fastmath=True)
def check_pairs_simp_rev_p(raw_data_in):
#Create a transposed array not just a view
raw_data=np.ascontiguousarray(raw_data_in.T)
# 2D array to be filled
result = np.full((raw_data.shape[0],raw_data.shape[1]), -1)
# Iterate over all possible gene combinations
for r1 in nb.prange(0, raw_data.shape[0]):
for r2 in range(r1+1, raw_data.shape[0]):
num_pos=0
for i in range(raw_data.shape[1]):
if (raw_data[r1,i]>raw_data[r2,i]):
num_pos+=1
# Arbitrary check to illustrate
if num_pos >= 5:
result[r1,r2] = num_pos
else:
result[r2,r1] = num_pos
return result
Timings
%timeit check_pairs_sg(a)
488 ms ± 8.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit check_pairs_simp(a)
186 ms ± 3.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit check_pairs_simp_rev(a)
12.1 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit check_pairs_simp_rev_p(a)
5.43 ms ± 49.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 7