Reputation: 180
Am implementing Sieve of Eratosthenes in CUDA and am having a very weird output. Am using unsigned char* as the data structure and using the following macros to manipulate the bits.
#define ISBITSET(x,i) ((x[i>>3] & (1<<(i&7)))!=0)
#define SETBIT(x,i) x[i>>3]|=(1<<(i&7));
#define CLEARBIT(x,i) x[i>>3]&=(1<<(i&7))^0xFF;
I set the bit to denote it's a prime number, otherwise it's = 0. Here is where i call my kernel
size_t p=3;
size_t primeTill = 30;
while(p*p<=primeTill)
{
if(ISBITSET(h_a, p) == 1){
int dimA = 30;
int numBlocks = 1;
int numThreadsPerBlock = dimA;
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );
cudaThreadSynchronize();
reverseArrayBlock<<< dimGrid, dimBlock >>>( d_a, primeTill, p );
cudaThreadSynchronize();
cudaMemcpy( h_a, d_a, memSize, cudaMemcpyDeviceToHost );
cudaThreadSynchronize();
printf("This is after removing multiples of %d\n", p);
//Loop
for(size_t i = 0; i < primeTill +1; i++)
{
printf("Bit %d is %d\n", i, ISBITSET(h_a, i));
}
}
p++;
}
Here is my kernel
__global__ void reverseArrayBlock(unsigned char *d_out, int size, size_t p)
{
int id = blockIdx.x*blockDim.x + threadIdx.x;
int r = id*p;
if(id >= p && r <= size )
{
while(ISBITSET(d_out, r ) == 1 ){
CLEARBIT(d_out, r);
}
// if(r == 9)
// {
// /* code */
// CLEARBIT(d_out, 9);
// }
}
} The output should be: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29 while my output is: 2, 3, 5, 9, 7, 11, 13, 17, 19, 23, 29
If you take a look at the kernel code, if i uncomment those lines i will get the correct answer, which means that there is nothing wrong with my loops or my checking!
Upvotes: 3
Views: 2217
Reputation: 2281
Multiple threads are accessing the same word (char) in global memory simultaneously and thus the written result gets corrupted.
You could use atomic operations to prevent this but the better solution would be to alter your algorithm: Instead of letting every thread sieve out multiples of 2, 3, 4, 5, ... let every thread check a range like [0..7], [8..15], ... so that every range's length is a multiple of 8 bits and no collisions occur.
Upvotes: 1
Reputation: 35453
I would suggest replacing the macros with methods to start with. You can use methods preceded by __host__
and __device__
to generate cpp and cu specific versions where necessary. That will eradicate the possibility of the pre-processor doing something unexpected.
Now just debug the particular code branch that is causing the wrong output, checking that each stage is correct in turn and you'll find the problem.
Upvotes: 1