Cuda - Big array initialization

Question

What is the best approach (efficiently) to initialize a large array of integers for the gpu? I need to assign 1 for first two elements and 0 for other (for Sieve of Eratosthenes).

cudaMemcpy
cudaMemset + set value of 2 first elements in kernel
initialization direct in kernel
sth else

Note: Array size is dynamic (n is passed as an argument).

My current version:

int array = (int*) malloc(array_size);
array[0] = 1;
array[1] = 1;
for (int i = 2; i < n; i++) {
    array[i] = 0;
}
HANDLE_ERROR(cudaMemcpy(dev_array, array, array_size, cudaMemcpyHostToDevice));
kernel<<<10, 10>>>(dev_array);

I would be grateful for an example.

Grzegorz Szpetkowski · Accepted Answer

One possibility is to directly initialize __device__ array on GPU if it has constant size by adding following declaration at file scope (that is, outside of any function):

__device__ int dev_array[SIZE] = {1, 1};

The remaining elements will be initiliazed with zeros (you can check PTX assembly to be sure of that).

then, it can be used in kernel like:

__global__ void kernel(void)
{
    int tid = ...;
    int elem = dev_array[tid];
    ...
}

In case of variable size, you can combine cudaMalloc() with cudaMemset():

int array_size = ...;
int *dev_array;

cudaMalloc((void **) &dev_array, array_size * sizeof(int));
cudaMemset(dev_array, 0, array_size * sizeof(int));

then set first two elements as ones:

int helper_array[2] = {1, 1};
cudaMemcpy(dev_array, helper_array, 2 * sizeof(int), cudaMemcpyHostToDevice);

Beginning with compute capability 2.0 you can also allocate whole array directly within kernel by the malloc() device function:

__global__ void kernel(int array_size)
{
    int *dev_array;
    int tid = ...;

    if (tid == 0) {
        dev_array = (int *) malloc(array_size * sizeof(int));
        if (dev_array == NULL) {
            ...
        }
        memset(dev_array, 0, array_size * sizeof(int));
        dev_array[0] = dev_array[1] = 1;  
    }
    __syncthreads();

    ...
}

Note that threads from different blocks are unaware of barrier synchronization.

From the CUDA C Programming Guide:

The CUDA in-kernel malloc() function allocates at least size bytes from the device heap and returns a pointer to the allocated memory or NULL if insufficient memory exists to fulfill the request. The returned pointer is guaranteed to be aligned to a 16-byte boundary.

Unfortunatelly, the calloc() function is not implemented, hence you need to memset it anyway. Allocated memory has lifetime of CUDA context, but you can explicitely call free() from this or subsequent kernel at any time:

The memory allocated by a given CUDA thread via malloc() remains allocated for the lifetime of the CUDA context, or until it is explicitly released by a call to free(). It can be used by any other CUDA threads even from subsequent kernel launches.

With all that said, I wouldn't mind that much about supplementary cudaMemcpy(), since it's just two elements to copy and it would likely take less than 0.01% of the total execution time (it's easy to profile). Choose whatever way that makes you code clear. Otherwise it's a premature optimization.

Cuda - Big array initialization

Answers (1)

Related Questions