liza
liza

Reputation: 31

2D array on CUDA

I want to dynamically allocate global 2D array in CUDA. How can I achieve this?

In my main I am calling my Kernel in a loop. But before I call the kernel I need to allocate some memory on the GPU. After the kernel call a single integer is send from GPU to CPU to inform whether the problem is solved or not.
If the problem is not solved, I will not free the old memory , since there is a further need of it, and I should allocate new memory to the GPU and call the kernel again.

a sudocode is shown:

int n=0,i=0;
while(n==0)
{
    //allocate 2d memory for MEM[i++] 
    //call kernel(MEM,i)
    // get n from kernel       
}


__global__ void kernerl(Mem,int i)
{
    Mem[0][5]=1;
    Mem[1][0]=Mem[0][5]+23;//can use this when MEM[1] is allocated before kernel call
}

Any suggestions? Thank you.

Upvotes: 1

Views: 6450

Answers (3)

x13n
x13n

Reputation: 4163

Well, you can do it just as it would be done on CPU.

unsigned xSize = 666, ySize = 666;
int **h_ptr = (int**)malloc(sizeof(int*) * xSize);
int **d_ptr = NULL;
cudaMalloc( &d_ptr, xSize );
for(unsigned i = 0; i < xSize; ++i)
{
    cudaMalloc( &h_ptr[i], ySize );
}
cudaMemcpy( &d_ptr, &h_ptr, sizeof(int*) * xSize, cudaMemcpyHostToDevice );
free( h_ptr );

...and free similiarly

int **h_ptr = (int**)malloc(sizeof(int*) * xSize);
cudaMemcpy( &h_ptr, &d_ptr, sizeof(int*) * xSize, cudaMemcpyDeviceToHost );
for(unsigned i = 0; i < xSize; ++i )
{
    cudaFree( h_ptr[i] );
}
cudaFree( d_ptr );
free( h_ptr );

But you should keep in mind, that every access to a cell of this array will involve accesing GPU global memory twice. Due to that, memory access will be two times slower than with 1d array.

Upvotes: 2

talonmies
talonmies

Reputation: 72342

Two opening comments - using a dynamically allocated 2D array is a bad idea in CUDA, and doing repetitive memory allocations in a loop is also not a good idea. Both incur needless performance penalties.

For the host code, something like this:

size_t allocsize = 16000 * sizeof(float);
int n_allocations = 16;
float * dpointer
cudaMalloc((void **)&dpointer, n_allocations * size_t(allocsize));

float * dcurrent = dpointer;
int n = 0;
for(int i=0; ((n==0) && (i<n_allocations)); i++, dcurrent+=allocsize) {

    // whatever you do before the kernel

    kernel <<< gridsize,blocksize >>> (dcurrent,.....);

    // whatever you do after the kernel

}

is preferable. Here you only call cudaMalloc once, and pass offsets into the allocation, which makes memory allocation and management free inside the loop. The loop structure also means you can't run endlessly and exhaust all the GPU memory.

On the 2D array question itself, there are two reasons why it is a bad idea. Firstly, the allocation requires of a 2D array with N rows requires (N+1) cudaMalloc calls and a host device memory copy, which is slow and ugly. Secondly inside the kernel code, to get at your data, the GPU must do two global memory reads, one for the pointer indirection to get the row address, and then one to fetch from the data from the row. That is much slower than this alternative:

#define idx(i,j,lda) ( (j) + ((i)*(lda)) )
__global__ void kernerl(float * Mem, int lda, ....)
{
    Mem[idx(0,5,lda)]=1; // MemMem[0][5]=1;
}

which uses indexing into a 1D allocation. In the GPU memory transactions are very expensive, but FLOPS and IOPS are cheap. A single integer multiply-add is the most efficient way to do this. If you need to access results from a previous kernel call, just pass the offset to the previous results and use two pointers inside the kernel, something like this:

__global__ void kernel(float *Mem, int lda, int this, int previous)
{
   float * Mem0 = Mem + this;
   float * Mem1 = Mem + previous;

}

Efficient distributed memory programs (and CUDA is really a type of distributed memory programming) start to look like Fortran after a while, but that is the price you pay for portability, transparency and efficiency.

Hope this helped.

Upvotes: 6

Marco
Marco

Reputation: 57593

EDITED:
I was trying to help you providing an example in which, flattening the array, you can achieve the same result, but mates told me it's not what you're asking for.
So there is another post here telling you how you can allocate 2d arrays in CUDA.

Upvotes: 0

Related Questions