Reputation: 531
I wrote a test code in python using numba.
from numba import cuda
import numpy as np
import numba
@cuda.jit
def function(output, size, random_array):
i_p, i_k1, i_k2 = cuda.grid(3)
a=cuda.local.array(shape=1,dtype=numba.float64)
if i_p<size and i_k1<size and i_k2<size:
a1=i_p
a2=i_k1+1
a3=i_k2+2
a[0]=a1
a[1]=a2
a[2]=a3
for i in range(len(random_array)):
output[i_p,i_k1,i_k2,i] = a[int(random_array[i])]
output=cuda.device_array((2,2,2,5))
random_array=np.array([np.random.random()*3 for i in range(5)])
print(random_array)
random_array0=cuda.to_device(random_array)
size=2
threadsperblock = (8, 8, 8)
blockspergridx=(size + (threadsperblock[0] - 1)) // threadsperblock[0]
blockspergrid = ((blockspergridx, blockspergridx, blockspergridx))
# Start the kernel
function[blockspergrid, threadsperblock](output, size, random_array0)
print(output.copy_to_host())
# test if it is consistent with non gpu case
output=np.zeros([2,2,2,5])
for i in range(size):
for j in range(size):
for k in range(size):
a=[i,j+1,k+2]
for ii in range(len(random_array)):
output[i,j,k,ii] = a[int(random_array[ii])]
print(output)
I am confused about the usage of cuda.local.array.
It has two arguments. One is shape and another is dtype.
However, the result does not change with different set up of shape. For example, shape=0 or shape =1 or shape=100.
I don't understand this argument shape.
Does anyone know this?
Upvotes: 1
Views: 4955
Reputation: 72348
Quoting directly from the documentation:
Local memory is an area of memory private to each thread. Using local memory helps allocate some scratchpad area when scalar local variables are not enough. The memory is allocated once for the duration of the kernel, unlike traditional dynamic memory management.
numba.cuda.local.array(shape, type)
Allocate a local array of the given shape and type on the device. shape is either an integer or a tuple of integers representing the array’s dimensions and must be a simple constant expression. type is a Numba type of the elements needing to be stored in the array. The array is private to the current thread. An array-like object is returned which can be read and written to like any standard array (e.g. through indexing).
So in this case, where you want a local memory with at least three elements, you must have shape >= 3
for your code to work correctly.
The fact that you code appears to work with shape=1
should be regarded as undefined behaviour. If I run your code using cuda-memcheck
I get this:
$ cuda-memcheck python indexing.py
========= CUDA-MEMCHECK
[ 1.99261914 1.91166157 2.85454532 1.64078385 1.9576766 ]
========= Invalid __local__ write of size 8
========= at 0x000001b0 in cudapy::__main__::function$241(Array<double, int=4, A, mutable, aligned>, __int64, Array<double, int=1, A, mutable, aligned>)
========= by thread (1,1,1) in block (0,0,0)
========= Address 0x00fffc80 is out of bounds
[SNIPPED for brevity]
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x2cd) [0x23c06d]
Traceback (most recent call last):
File "indexing.py", line 42, in <module>
outputd = output.copy_to_host()
File "/opt/miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/devicearray.py", line 198, in copy_to_host
_driver.device_to_host(hostary, self, self.alloc_size, stream=stream)
File "/opt/miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 1481, in device_to_host
fn(host_pointer(dst), device_pointer(src), size, *varargs)
File "/opt/miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 259, in safe_cuda_api_call
self._check_error(fname, retcode)
File "/opt/miniconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 296, in _check_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR
========= ERROR SUMMARY: 9 errors
i.e. running with an incorrect local array size produces memory access errors as you would expect. However, the code still actually runs. On the other hand, if I modify your code to use shape=3
:
$ cuda-memcheck python indexing.py
========= CUDA-MEMCHECK
[ 1.98532356 1.53822652 0.69376061 2.22448278 0.76800584]
True
========= ERROR SUMMARY: 0 errors
The memory access errors disappear. So you should not confuse working correctly and undefined behaviour (which can include accidentally working, but throwing errors, as in this case). The exact reasons why this happens will be buried in the numba runtime and the code its compiler produces. I have no interest in looking at that in detail to explain further.
Upvotes: 2