Reputation: 1
Which part to change to speed up this code? And what exactly is code is doing?
__global_ void mat(Matrix a, Matrix b)
{
int[] tempData = new int[2];
tempData[0] = threadIdx.x ;
tempData[1] = blockIdx.x * blockDim;
b.elements[tempData[1] + tempData[0]] = b.elements[tempData[1] + tempData[0]] * 5;
}
Upvotes: 0
Views: 110
Reputation: 151799
If that is all the code in question, this is silliness:
int[] tempData = new int[2];
tempData[0] = threadIdx.x ;
tempData[1] = blockIdx.x * blockDim;
Just do this instead:
__global__ void mat(Matrix a, Matrix b)
{
int tempData_0 = threadIdx.x ;
int tempData_1 = blockIdx.x * blockDim;
b.elements[tempData_1 + tempData_0] = b.elements[tempData_1 + tempData_0] * 5;
}
The construct tempdata[0] + tempdata[1]
is effectively creating the canonical CUDA 1D globally unique thread index:
int idx = threadIdx.x+blockDim.x*blockIdx.x;
With that constructed index, then your main code is:
b.elements[idx] = b.elements[idx] * 5;
Which is taking the elements of a vector (or a matrix, where the rows or columns are stored contiguously) and multiplying them by 5.
Your code could probably be simplified more to make it easier to read, following the outline I have give you using idx
, but those changes won't make a significant performance difference. The compiler can figure out those kinds of transformations.
Upvotes: 1