Reputation: 2014
I'm having a problem with some for nested loops that I have to convert from C/C++ into CUDA. Basically I have 4 for nested loops which are sharing the same array and making bit shift operations.
#define N 65536
// ----------------------------------------------------------------------------------
int a1,a2,a3,a4, i1,i2,i3,i4;
int Bit4CBitmapLookUp[16] = {0, 1, 3, 3, 7, 7, 7, 7, 15, 15, 15, 15, 15, 15, 15, 15};
int _cBitmapLookupTable[N];
int s = 0; // index into the cBitmapLookupTable
for (i1 = 0; i1 < 16; i1++)
{
// first customer
a1 = Bit4CBitmapLookUp[i1] << 12;
for (i2 = 0; i2 < 16; i2++)
{
// second customer
a2 = Bit4CBitmapLookUp[i2] << 8;
for (i3 = 0; i3 < 16; i3++)
{
// third customer
a3 = Bit4CBitmapLookUp[i3] << 4;
for (i4 = 0;i4 < 16;i4++)
{
// fourth customer
a4 = Bit4CBitmapLookUp[i4];
// now actually set the sBitmapLookupTable value
_cBitmapLookupTable[s] = a1 | a2 | a3 | a4;
s++;
} // for i4
} // for i3
} // for i2
} // for i1
This is the code that I should convert into CUDA. I tried different ways but everytime i having the wrong output. Here i post my version of CUDA conversion (the piece from kernel's part)
#define N 16
//----------------------------------------------------------------------------------
// index for the GPU
int i1 = blockDim.x * blockIdx.x + threadIdx.x;
int i2 = blockDim.y * blockIdx.y + threadIdx.y;
int i3 = i1;
int i4 = i2;
__syncthreads();
for(i1 = i2 = 0; i1 < N, i2 < N; i1++, i2++)
{
// first customer
a1 = Bit4CBitmapLookUp_device[i1] << 12;
// second customer
a2 = Bit4CBitmapLookUp_device[i2] << 8;
for(i3 = i4 = 0; i3 < N, i4 < N; i3++, i4++){
// third customer
a3 = Bit4CBitmapLookUp_device[i3] << 4;
// fourth customer
a4 = Bit4CBitmapLookUp_device[i4];
// now actually set the sBitmapLookupTable value
_cBitmapLookupTable[s] = a1 | a2 | a3 | a4;
s++;
}
}
I'm brand new in CUDA and I'm still learning, but really i can't find a solution for those for nested loops. Thank you in advance.
Upvotes: 3
Views: 3355
Reputation: 1897
As leftaroundabout already indicated there's a problem with the initialization. What I would recommend is that you rewrite the program as follows
int i1 = blockDim.x * blockIdx.x + threadIdx.x;
int i2 = blockDim.y * blockIdx.y + threadIdx.y;
int i3;
int i4;
while(i1 < N && i2 < N){
a1 = ..;
a2 = ..;
for(i3 = i4 = 0; i3 < N, i4 < N; i3++, i4++){
// third customer
a3 = Bit4CBitmapLookUp_device[i3] << 4;
// fourth customer
a4 = Bit4CBitmapLookUp_device[i4];
// now actually set the sBitmapLookupTable value
_cBitmapLookupTable[s] = a1 | a2 | a3 | a4;
s ++;
}
s += blockDim.x*gridDim.x*blockDim.y*gridDim.y;
i1 += blockDim.x*gridDim.x;
i2 += blockDim.y*gridDim.y;
}
I haven't tested it, so I can't guarantee that the indices are correct. I'll leave that to you.
A bit more explanation: In the code above only the loops over i1 and i2 are parallelized. This assumes that N**2 is large enough compared to the number of cores you have on your GPU. If this is not the case. All four loops need to be parallelized in order to obtain an efficient program. The approach would then be a bit different.
Upvotes: 3