Reputation: 4089
I have two reasonably large arrays of lengths N and M elements respectively. For each of the N elements I need to do a calculation with each of the M elements and then reduce these results in order to arrive at another array of length N. This sounds like the type of problem that is perfectly suited to GPU acceleration and I would therefore like to implement it using Numba CUDA, but I am struggling to find out how to deal with the reduction part of this problem. The Numba documentation on reduction https://numba.pydata.org/numba-doc/dev/cuda/reduction.html only shows how everything can be reduced into one number, but I essentially need to reduce into an array. Below is a super simplified example of basically what I want to achieve
from numba import cuda
import numpy as np
@cuda.jit
def processArr(A, B):
i = cuda.grid(1)
if i < A.size:
A[i] = A[i] * B
@cuda.jit
def reduceArr(A, B, C):
i = cuda.grid(1)
if i < A.size:
total = 1
processArr(A[i], B[i])
for j in range(A.shape[1]):
total *= A[i, j]
C[i] = total
a = np.array([[0, 0], [1, 1], [1, 2]])
b = np.array([1, 2, 3])
c = np.zeros(3)
threadsperblock = 32
blockspergrid = math.ceil(b.shape[0] / threadsperblock)
reduceArr[blockspergrid, threadsperblock](a, b, c)
print(c)
This code obviously does not run but hopefully it illustrates what I want to achieve.
Is there a way to achieve this with Numba or is it silly to try and do the reduction step on the GPU in the first place i.e. is it best to just do the NxM operations on the GPU and then reduce these results on the CPU afterwards?
Upvotes: 1
Views: 1221
Reputation: 151972
Is there a way to achieve this with Numba
Yes. What you are wanting to do is a segmented (or vectorized) transform-reduce operation. The thing that you are calling processArr
is effectively your transform operation.
To perform multiple parallel segmented reduce operations, there are a number of methods, the "best" of which will depend on the specific sizes of N
and M
.
For your choice of N ~= 100
, I will suggest the use of a CUDA block per reduction. We will perform a classical parallel reduction but at the CUDA block level, and each block will be responsible for one result element. Therefore each block must process N
elements, and we must launch a total number of blocks equal to M
. For the block level processing, we will implement a block-stride loop which is conceptually similar to a grid-stride loop.
Here is an example, roughly based on what you have shown:
$ cat t47.py
from numba import cuda
import numpy as np
# must be power of 2, less than 1025
nTPB = 128
reduce_init_val = 0
@cuda.jit(device=True)
def reduce_op(x,y):
return x+y
@cuda.jit(device=True)
def transform_op(x,y):
return x*y
@cuda.jit
def transform_reduce(A, B, C):
s = cuda.shared.array(nTPB, dtype=A.dtype)
b = cuda.blockIdx.x
t = cuda.threadIdx.x
if (b < A.shape[0]):
s[t] = reduce_init_val
#block-stride loop
for i in range(t, A.shape[1], nTPB):
s[t] = reduce_op(s[t], transform_op(A[b,i], B[b]))
#parallel sweep reduction
l = nTPB//2
while (l > 0):
cuda.syncthreads()
if (t < l):
s[t] = reduce_op(s[t], s[t+l])
l = l//2
if (t == 0):
C[b] = s[0]
a = np.array([[0, 0], [1, 1], [1, 2]], dtype=np.float64)
b = np.array([1, 2, 3], dtype=np.float64)
c = np.zeros(3, dtype=np.float64)
threadsperblock = nTPB
blockspergrid = a.shape[0]
transform_reduce[blockspergrid, threadsperblock](a, b, c)
print(c)
$ python t47.py
[0. 4. 9.]
$
I'm not suggesting the above code is defect-free or suitable for any particular purpose. My intent is to outline a possible method. Note that the above code should be able to handle more-or less arbitrary M
and N
dimensions.
Upvotes: 5