Reputation: 1411
I'm trying to implement an transform which at some stage in it has a lookup table < 1K in size. This seems to me like it shouldn't pose a problem to a modern graphics card.
But the code below is failing with an unknown error:
from numba import cuda, vectorize
import numpy as np
tmp = np.random.uniform( 0, 100, 1000000 ).astype(np.int16)
tmp_device = cuda.to_device( tmp )
lut = np.arange(100).astype(np.float32) * 2.5
lut_device = cuda.to_device(lut)
@cuda.jit(device=True)
def lookup(x):
return lut[x]
@vectorize("float32(int16)", target="cuda")
def test_lookup(x):
return lookup(x)
test_lookup(tmp_device).copy_to_host() # <-- fails with cuMemAlloc returning UNKNOWN_CUDA_ERROR
What am I doing against the spirit of numba.cuda?
Even replacing lookup
with the following simplified code results in the same error:
@cuda.jit(device=True)
def lookup(x):
return x + lut[1]
Once this error occurs, I am essentially no longer able to utilize the cuda context at all. For instance, allocating a new array via cuda.to_device
results in a:
numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemAlloc results in UNKNOWN_CUDA_ERROR
Running on: 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04)
Driver Version: 390.25
numba: 0.33.0
Upvotes: 3
Views: 597
Reputation: 1411
The above code is fixed by modifying the part in bold:
@cuda.jit(device=True)
def lookup(x):
lut_device = cuda.const.array_like(lut)
return lut_device[x]
I ran multiple variations of the code including simply touching the lookup table from within this kernel, but not using its output. This combined with @talonmies' assertion that UNKNOWN_CUDA_ERROR usually occurs with invalid instructions, I thought that perhaps there was a shared memory constraint that was causing the issue.
The above code makes the whole thing work. However, I still don't understand why in a profound way.
If anyone knows and understands why, please feel free to contribute to this answer.
Upvotes: 1