Igoren
Igoren

Reputation: 111

Is there a limit on the size of array that can be used in CUDA?

Written a program that calculates the integral of a simple function. When testing it I found that if I used an array of size greater than 10 million elements it produced the wrong answer. I found that the error seemed to be occurring when once the array had been manipulated in a CUDA kernel. 10 millions elements and below worked fine and produced the correct result.

Is there a size limit on the amount of elements that can be transferred across to the GPU or calculated upon the GPU?

P.S. using C style arrays containing floats.

Upvotes: 5

Views: 5541

Answers (2)

bandybabboon
bandybabboon

Reputation: 2346

For compute capability>=3.0 the max grid dimensions are 2147483647x65535x65535, so for a that should cover any 1-D array of sizes up to 2147483647x1024 = 2.1990233e+12.

1 billion element arrays are definitely fine.

1,000,000,000/1024=976562.5, and round up to 976563 blocks. Just make sure that if threadIdx.x+blockIdx.x*blockDim.x>= number of elements you return from kernel without processing.

Upvotes: 0

void-pointer
void-pointer

Reputation: 14827

There are many different kinds of memory that you can use with CUDA. In particular, you have

  • Linear Memory (cuMemAlloc)
  • Pinned Memory (cuMemHostAlloc)
  • Zero-Copy Memory (cuMemAllocHost)
  • Pitch Allocation (cuMemAllocPitch)
  • Textures Bound to Linear Memory
  • Textures Bound to CUDA Arrays
  • Textures Bound to Pitch Memory
  • ...and cube maps and surfaces, which I will not list here.

Each kind of memory is associated with its own hardware resource limits, many of which you can find by using cuDeviceGetAttribute. The function cuMemGetInfo returns the amount of free and total memory on the device, but because of alignment requirements, allocating 1,000,000 floats may result in more than 1,000,000 * sizeof(float) bytes being consumed. The maximum number of blocks that you can schedule at once is also a limitation: if you exceed it, the kernel will fail to launch (you can easily find this number using cuDeviceGetAttribute). You can find out the alignment requirements for different amounts of memory using the CUDA Driver API, but for a simple program, you can make a reasonable guess and check the value of allocation function to determine whether the allocation was successful.

There is no restriction on the amount of bytes that you can transfer; using asynchronous functions, you can overlap kernel execution with memory copying (providing that your card supports this). Exceeding the maximum number of blocks you can schedule, or consuming the available memory on your device means that you will have to split up your task so that you can use multiple kernels to handle it.

Upvotes: 4

Related Questions