Anoracx
Anoracx

Reputation: 438

Read an Array With Threads in CUDA

I was wondering if it was possible, and what was the best way to read cells from an array with threads in CUDA. To simplify what I mean this is an example :

I have an array : {1,2,3,4,5,6,...} and I would like each threads to read n cells of my array depending mainly of its size.

I have been trying a few things, but it seems not to work, so if anyone could point out a (right) way to do it, that would be great.

Thank you.

Upvotes: 1

Views: 2649

Answers (2)

lashgar
lashgar

Reputation: 5470

You need to:

the threads have to look at the n next numbers

So you can use:

#define N 2
#define NTHREAD 1024
#define ARRAYSIZE N*NTHREAD

// develop the kernel as:
__global__ void accessArray(int *array){
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    int startId = tid*N;

    // access thread's stride
    for(int i=0; i<N; i++){
        array[startId+i]=tid;
    }
}
// call the kernel by:
accessArray<<<NTHREAD/256, 256>>>(d_array);

dump out the array and check whether it is how you want your thread work or not.

Full code:

#include <cuda.h>
#include <stdio.h>


#define N 2
#define NTHREAD 1024
#define ARRAYSIZE N*NTHREAD

// develop the kernel as:
__global__ void accessArray(int *array){
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    int startId = tid*N;

    // access thread's stride
    for(int i=0; i<N; i++){
        array[startId+i]=tid;
    }
}

int  main()
{
    int h_array[ARRAYSIZE];
    int *d_array;
    size_t memsize= ARRAYSIZE * sizeof(float);

    for (int i=0; i< ARRAYSIZE; i++) {
        h_array[i] = 0;
    }

    cudaMalloc(&d_array, memsize);
    cudaMemcpy(d_array, h_array, memsize,  cudaMemcpyHostToDevice);

    accessArray<<<NTHREAD/256, 256>>>(d_array);
    cudaMemcpy(h_array, d_array, memsize,  cudaMemcpyDeviceToHost);

    for (int i=0; i<ARRAYSIZE; i++)
        printf("A[%d] => %d\n",i,h_array[i]);

    cudaFree(d_array);
}

Upvotes: 0

harrism
harrism

Reputation: 27899

Generally you want contiguous threads to read contiguous array indices. Doing so results in "coalesced" memory transactions. The simple way to think of it is that if 32 threads are running physically in parallel, and they all do a load, then if all 32 loads fall into the same cache line, then a single memory access can be performed to fill the cache line, rather than 32 separate ones.

So what you want to do is have each thread access n cells that are strided by the number of threads, like this (assuming input data is in the float array data).

int idx = blockDim.x * blockIdx.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = idx; i < numElements; i += stride) {
  float element = data[i];
  process(element);
}

If your algorithm requires that each thread reads n contiguous data elements, then you are going to incur non-coalesced loads, which will be much more expensive. In this case, I would consider re-designing the algorithm so this type of access is not required.

Upvotes: 3

Related Questions