Lostsoul
Lostsoul

Reputation: 25999

Writing large unknown size array in Cuda?

I have a process which I send data to Cuda to process and it outputs data that matches a certain criteria. The problem is I often don't know the size out of outputted array. What can I do?

I send in several hundred lines of data and have it processed in over 20K different ways on Cuda. If the results match some rules I have then I want to save the results. The problem is I cannot create a linked list in Cuda(let me know if I can) and memory on my card is small so I was thinking of using zero copy to have Cuda write directly to the hosts memory. This solves my memory size issue but still doesn't give me a way to deal with unknown.

My intial idea was to figure out the max possible results and malloc a array of that size. The problem is it would be huge and most would not be used(800 lines of data * 20K possible outcomes = 16 Million items in a array..which is not likely).

Is there a better way to deal with variable size arrays in Cuda? I'm new to programming so ideally it would be something not too complex(although if it is I'm willing to learn it).

Upvotes: 0

Views: 2245

Answers (3)

raducu2205
raducu2205

Reputation: 29

Here is an useful link: https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/

You can do in your kernel function something like this, using shared memory:

__global__ void dynamicReverse(int *d, int n)
{
  extern __shared__ int s[];
  .....
}

and when you call the kernel function on host, having third parameter the shared memory size, precisely n*sizeof(int): dynamicReverse<<<1,n,n*sizeof(int)>>>(d_d, n);

Also, it's a best practice to split a huge kernel function, if possible, in more kernel functions, having less code and are easier to execute.

Upvotes: 0

J J
J J

Reputation: 146

Yes, the CUDA and all GPGPU stuff bottleneck is transfer from host to device and back.

But in kernels, use always everything known size. Kernel must not do malloc... it is very very weird from the concept of the platform. Even if you have 'for' - loop in CUDA kernel, think 20 times about is your approach optimal, you must be doing realy complex algorithm. Is it really necessary on the parallel platform ? You would not believe what problems could come if you don't )))

Use buffered approach. You determine some buffer size, what is more dependent of CUDA requirements( read -> hardware), then of your array. You call a kernel in the loop and upload, process and retrieve data from there. Ones, your array of data will be finished and last buffer will be not full. You can pass the size of each buffer as single value (pointer to an int for example), what each thread will compare to its thread id, to determine do it if it is possible to get some value or it would be out of bounds. Only the last block will have divergence.

Upvotes: 0

geek
geek

Reputation: 1839

Heap memory allocation using malloc in kernel code is expensive operation (it forces CUDA driver initialize kernel with custom heap size and manage memory operations inside the kernel).

Generally, CUDA device memory allocation is the main bottleneck of program performance. The common practice is to allocate all needed memory at the beginning and reuse it as long as possible.

I think that you can create such buffer that is big enough and use it instead of memory allocations. In worst case you can wrap it to implement memory allocation from this buffer. In simple simple case you can keep last free cell in your array to write data into it next time.

Upvotes: 1

Related Questions