Reputation: 438
I was wondering if it was possible, and what was the best way to read cells from an array with threads in CUDA. To simplify what I mean this is an example :
I have an array : {1,2,3,4,5,6,...} and I would like each threads to read n cells of my array depending mainly of its size.
I have been trying a few things, but it seems not to work, so if anyone could point out a (right) way to do it, that would be great.
Thank you.
Upvotes: 1
Views: 2649
Reputation: 5470
You need to:
the threads have to look at the n next numbers
So you can use:
#define N 2
#define NTHREAD 1024
#define ARRAYSIZE N*NTHREAD
// develop the kernel as:
__global__ void accessArray(int *array){
int tid = blockDim.x * blockIdx.x + threadIdx.x;
int startId = tid*N;
// access thread's stride
for(int i=0; i<N; i++){
array[startId+i]=tid;
}
}
// call the kernel by:
accessArray<<<NTHREAD/256, 256>>>(d_array);
dump out the array
and check whether it is how you want your thread work or not.
Full code:
#include <cuda.h>
#include <stdio.h>
#define N 2
#define NTHREAD 1024
#define ARRAYSIZE N*NTHREAD
// develop the kernel as:
__global__ void accessArray(int *array){
int tid = blockDim.x * blockIdx.x + threadIdx.x;
int startId = tid*N;
// access thread's stride
for(int i=0; i<N; i++){
array[startId+i]=tid;
}
}
int main()
{
int h_array[ARRAYSIZE];
int *d_array;
size_t memsize= ARRAYSIZE * sizeof(float);
for (int i=0; i< ARRAYSIZE; i++) {
h_array[i] = 0;
}
cudaMalloc(&d_array, memsize);
cudaMemcpy(d_array, h_array, memsize, cudaMemcpyHostToDevice);
accessArray<<<NTHREAD/256, 256>>>(d_array);
cudaMemcpy(h_array, d_array, memsize, cudaMemcpyDeviceToHost);
for (int i=0; i<ARRAYSIZE; i++)
printf("A[%d] => %d\n",i,h_array[i]);
cudaFree(d_array);
}
Upvotes: 0
Reputation: 27899
Generally you want contiguous threads to read contiguous array indices. Doing so results in "coalesced" memory transactions. The simple way to think of it is that if 32 threads are running physically in parallel, and they all do a load, then if all 32 loads fall into the same cache line, then a single memory access can be performed to fill the cache line, rather than 32 separate ones.
So what you want to do is have each thread access n
cells that are strided by the number of threads, like this (assuming input data is in the float
array data
).
int idx = blockDim.x * blockIdx.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = idx; i < numElements; i += stride) {
float element = data[i];
process(element);
}
If your algorithm requires that each thread reads n
contiguous data elements, then you are going to incur non-coalesced loads, which will be much more expensive. In this case, I would consider re-designing the algorithm so this type of access is not required.
Upvotes: 3