Reputation: 81
I'm quite new to cuda and pycuda.
I need a kernel that creates a matrix (of dimension n x d) out of an array (1 x d), by simply "repeating" the same array n times:
for example, suppose we have n = 4 and d = 3, then if the array is [1 2 3]
the result of my kernel should be:
[1 2 3
1 2 3
1 2 3
1 2 3]
(a matrix 4x3).
Basically, it's the same as doing numpy.tile(array, (n, 1))
I've written the code below:
kernel_code_template = """
__global__ void TileKernel(float *in, float *out)
{
// Each thread computes one element of out
int y = blockIdx.y * blockDim.y + threadIdx.y;
int x = blockIdx.x * blockDim.x + threadIdx.x;
if (y > %(n)s || x > %(d)s) return;
out[y * %(d)s + x] = in[x];
}
"""
d = 64
n = 512
blockSizex = 16
blockSizey = 16
gridSizex = (d + blockSizex - 1) / blockSizex
gridSizey = (n + blockSizey - 1) / blockSizey
# get the kernel code from the template
kernel_code = kernel_code_template % {
'd': d,
'n': n
}
mod = SourceModule(kernel_code)
TileKernel = mod.get_function("TileKernel")
vec_cpu = np.arange(d).astype(np.float32) # just as an example
vec_gpu = gpuarray.to_gpu(vec_cpu)
out_gpu = gpuarray.empty((n, d), np.float32)
TileKernel.prepare("PP")
TileKernel.prepared_call((gridSizex, gridSizey), (blockSizex, blockSizey, 1), vec_gpu.gpudata, out_gpu.gpudata)
out_cpu = out_gpu.get()
Now, if I run this code with d equals a power of 2 >= 16 I get the right result (just like numpy.tile(vec_cpu, (n, 1)) ); but if I set d equals to anything else (let's say for example 88) I get that every element of the output matrix has the correct value, except the first column: some entries are right but others have another value, apparently random, same for every wrong element, but different every run, and also the entries of the first column that have the wrong value are different every run. Example:
[0 1 2
0 1 2
6 1 2
0 1 2
6 1 2
...]
I really can't figure out what is causing this problem, but maybe it's just something simple that I'm missing...
Any help will be appreciated, thanks in advance!
Upvotes: 1
Views: 316
Reputation: 72349
The bounds checking within your kernel code is incorrect. This
if (y > n || x > d) return;
out[y * d + x] = in[x];
should be:
if (y >= n || x >= d) return;
out[y * d + x] = in[x];
or better still:
if ((y < n) && (x < d))
out[y * d + x] = in[x];
All array valid indexing in the array lies on 0 < x < d
and 0 < y < n
. By allowing x=d
you have undefined behaviour, allowing the first entry in the next row of the output array to be overwritten with an unknown value. This explains why sometimes the results were correct and other times not.
Upvotes: 1