Reputation: 325
I am trying with cuda to set some elements in array by index to zero. My array size has about 7,000,000 elements. The index length is about 1,000. So I want to write the kernel code efficiently. The only technique I know is to set the block size by cudaOccupancyMaxPotentialBlockSize
. Could any one give me some suggestion to speed up?
e.g.
The pointer of the array a is double *a
, with size n
. The index's pointer is int * index
, with length n1
.
__global__ void setZero(int n, double * a,int n1, const int* index)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i<n)
{
for(int ii=0; ii<n1; ii++)
if(i==index[ii]-1)
a[i] = 0;
}
}
void main()
{
int blockSize;
int minGridSize;
int gridSize;
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, setZero, 0, n);
gridSize = (n + blockSize - 1) / blockSize;
setZero<<<gridSize, blockSize>>>(n, d_a, n1, d_index);
}
As a mini sample, a = {1,2,3,4,5}, index = [2,4]
. The output is a = {1,0,3,0,5}
.
Upvotes: 1
Views: 276
Reputation: 51583
Given your constrains I think the following would already be good enough:
__global__ void setZero(int n, double *a, int n1, const int* index, const int* index_size)
{
int id = threadIdx.x + blockIdx.x * blockDim.x;
if (id < index_size)
a[index[id]]=0
}
Upvotes: 2