Gabor Szita
Gabor Szita

Reputation: 329

What to do if I have more work-items than SIZE_MAX in OpenCL

My OpenCL program involves having about 7 billion work-items. In my C++ program, I would set this to my global_item_size:

size_t global_item_size = 7200000000;

If my program is compiled to 64-bit systems (x64), this global size is OK, since SIZE_MAX (the maximum value of size_t) is much larger than 7 billion. However, to ensure backwards compatibility I want to make sure that my program is able to compile to 32-bit systems (x86). On 32-bit systems, SIZE_MAX is about 4 billion, less than my global size, 7 billion. If I would try to set the global size to 7 billion, it would result in an overflow. What can I do in this case?

One of the solutions I was thinking about was to make a multi-dimensional global size and local size. However, this solution requires the kernel to calculate the original global size (because my kernel heavily depends on the global and local size), which would result in a performance loss.

The other solution I considered was to launch multiple kernels. I think this solution would be a little "sloppy" and synchronizing kernels also wouldn't be the best solution.

So my question basically is: How can I (if possible) make the global size larger than the maximum size of size_t? If this is not possible, what are some workarounds?

Upvotes: 0

Views: 239

Answers (2)

Rob
Rob

Reputation: 116

If you want to avoid batches you can give each kernel more work but effectively wrapping the code in a for loop. E.g.

for (int i = 0; i < WORK_ITEMS_PER_THREAD; ++i)
{
    size_t id = WORK_ITEMS_PER_THREAD * get_global_id(0) + i;

    ...
}

Upvotes: 1

ProjectPhysX
ProjectPhysX

Reputation: 5746

Try to use uint64_t global_item_size = 7200000000ull; to avoid 32-bit integer overflow.

If you are strictly limited to the maximum 32-bit number of work items, you could do the computation in several batches (exchange GPU buffers in between compute steps via PCIe transfer) or you could pack several data items into one GPU thread.

Upvotes: 1

Related Questions