Bit array in CUDA

Question

Am implementing Sieve of Eratosthenes in CUDA and am having a very weird output. Am using unsigned char* as the data structure and using the following macros to manipulate the bits.

#define ISBITSET(x,i) ((x[i>>3] & (1<<(i&7)))!=0)
#define SETBIT(x,i) x[i>>3]|=(1<<(i&7));
#define CLEARBIT(x,i) x[i>>3]&=(1<<(i&7))^0xFF;

I set the bit to denote it's a prime number, otherwise it's = 0. Here is where i call my kernel

size_t p=3;
size_t primeTill = 30;

while(p*p<=primeTill)
{
    if(ISBITSET(h_a, p) == 1){
        int dimA = 30;
        int numBlocks = 1;
        int numThreadsPerBlock = dimA;
        dim3 dimGrid(numBlocks);
        dim3 dimBlock(numThreadsPerBlock);
        cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );        
        cudaThreadSynchronize();    
        reverseArrayBlock<<< dimGrid, dimBlock >>>( d_a, primeTill, p );
        cudaThreadSynchronize();    
        cudaMemcpy( h_a, d_a, memSize, cudaMemcpyDeviceToHost );
        cudaThreadSynchronize();    
        printf("This is after removing multiples of %d
", p);
        //Loop
        for(size_t i = 0; i < primeTill +1; i++)
        {
            printf("Bit %d is %d
", i, ISBITSET(h_a, i));
        }
    }           
    p++;
}

Here is my kernel

__global__ void reverseArrayBlock(unsigned char *d_out, int size, size_t p)
{
int id = blockIdx.x*blockDim.x + threadIdx.x;
int r = id*p;
if(id >= p && r <= size )
{
    while(ISBITSET(d_out, r ) == 1 ){
        CLEARBIT(d_out, r);
    }

    // if(r == 9)
    // {
    //  /* code */
    //  CLEARBIT(d_out, 9);
    // }

}

} The output should be: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29 while my output is: 2, 3, 5, 9, 7, 11, 13, 17, 19, 23, 29

If you take a look at the kernel code, if i uncomment those lines i will get the correct answer, which means that there is nothing wrong with my loops or my checking!

Dave O. · Accepted Answer

Multiple threads are accessing the same word (char) in global memory simultaneously and thus the written result gets corrupted.

You could use atomic operations to prevent this but the better solution would be to alter your algorithm: Instead of letting every thread sieve out multiples of 2, 3, 4, 5, ... let every thread check a range like [0..7], [8..15], ... so that every range's length is a multiple of 8 bits and no collisions occur.

Bit array in CUDA

Answers (2)

Related Questions