Reputation: 81
I'm following the example here to create a variable-length local memory array. The kernel signature is something like this:
__kernel void foo(__global float4* ex_buffer,
int ex_int,
__local void *local_var)
Then I call clSetKernelArg
for the local memory kernel argument as follows:
clSetKernelArg(*kern, 2, sizeof(char) * MaxSharedMem, NULL)
Where MaxSharedMem
is set from querying CL_DEVICE_LOCAL_MEM_SIZE
.
Then inside the kernel I split up the allocated local memory into several arrays and other data structures and use them as I see fit. All of this works fine with AMD (gpu and cpu) and Intel devices. However, on Nvidia, I get the error CL_INVALID_COMMAND_QUEUE
when I enqueue this kernel and then run clFinish
on the queue.
This is a simple kernel that generates the mentioned error (local work size is 32):
__kernel
void s_Kernel(const unsigned int N, __local void *shared_mem_block )
{
const ushort thread_id = get_local_id(0);
__local double *foo = shared_mem_block;
__local ushort *bar = (__local ushort *) &(foo[1000]);
foo[thread_id] = 0.;
bar[thread_id] = 0;
}
The kernel runs fine if I allocate the same arrays and data structures in local memory statically. Could somebody provide an explanation for this behavior, and/or workarounds?
Upvotes: 2
Views: 1082
Reputation: 81
For those interested, I finally received an explanation from Nvidia. When the chunk of shared memory is passed in via a void pointer, the actual alignment does not match the expected alignment for a pointer to double (8-byte aligned). The GPU device throws an exception due to the misalignment.
As one of the comments pointed out, a way to circumvent the problem is to have the kernel parameter be a pointer to something that the compiler would properly align to at least 8 bytes (double, ulong, etc).
Ideally, the compiler would take responsibility for any alignment issues specific to the device, but because there is an implicit pointer cast in the little kernel featured in my question, I think it gets confused.
Once the memory is 8-byte aligned, a cast to a pointer type that assumes a shorter alignment (e.g. ushort) works without issues. So, if you're chaining the memory allocation like I'm doing, and the pointers are to different types, make sure to have the pointer to the largest type in the kernel signature.
Upvotes: 6