Salem Sayed
Salem Sayed

Reputation: 180

Bit array in CUDA

Am implementing Sieve of Eratosthenes in CUDA and am having a very weird output. Am using unsigned char* as the data structure and using the following macros to manipulate the bits.

#define ISBITSET(x,i) ((x[i>>3] & (1<<(i&7)))!=0)
#define SETBIT(x,i) x[i>>3]|=(1<<(i&7));
#define CLEARBIT(x,i) x[i>>3]&=(1<<(i&7))^0xFF;

I set the bit to denote it's a prime number, otherwise it's = 0. Here is where i call my kernel

size_t p=3;
size_t primeTill = 30;

while(p*p<=primeTill)
{
    if(ISBITSET(h_a, p) == 1){
        int dimA = 30;
        int numBlocks = 1;
        int numThreadsPerBlock = dimA;
        dim3 dimGrid(numBlocks);
        dim3 dimBlock(numThreadsPerBlock);
        cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );        
        cudaThreadSynchronize();    
        reverseArrayBlock<<< dimGrid, dimBlock >>>( d_a, primeTill, p );
        cudaThreadSynchronize();    
        cudaMemcpy( h_a, d_a, memSize, cudaMemcpyDeviceToHost );
        cudaThreadSynchronize();    
        printf("This is after removing multiples of %d\n", p);
        //Loop
        for(size_t i = 0; i < primeTill +1; i++)
        {
            printf("Bit %d is %d\n", i, ISBITSET(h_a, i));
        }
    }           
    p++;
}

Here is my kernel

__global__ void reverseArrayBlock(unsigned char *d_out, int size, size_t p)
{
int id = blockIdx.x*blockDim.x + threadIdx.x;
int r = id*p;
if(id >= p && r <= size )
{
    while(ISBITSET(d_out, r ) == 1 ){
        CLEARBIT(d_out, r);
    }

    // if(r == 9)
    // {
    //  /* code */
    //  CLEARBIT(d_out, 9);
    // }

}

} The output should be: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29 while my output is: 2, 3, 5, 9, 7, 11, 13, 17, 19, 23, 29

If you take a look at the kernel code, if i uncomment those lines i will get the correct answer, which means that there is nothing wrong with my loops or my checking!

Upvotes: 3

Views: 2217

Answers (2)

Dave O.
Dave O.

Reputation: 2281

Multiple threads are accessing the same word (char) in global memory simultaneously and thus the written result gets corrupted.

You could use atomic operations to prevent this but the better solution would be to alter your algorithm: Instead of letting every thread sieve out multiples of 2, 3, 4, 5, ... let every thread check a range like [0..7], [8..15], ... so that every range's length is a multiple of 8 bits and no collisions occur.

Upvotes: 1

Dan
Dan

Reputation: 35453

I would suggest replacing the macros with methods to start with. You can use methods preceded by __host__ and __device__ to generate cpp and cu specific versions where necessary. That will eradicate the possibility of the pre-processor doing something unexpected.

Now just debug the particular code branch that is causing the wrong output, checking that each stage is correct in turn and you'll find the problem.

Upvotes: 1

Related Questions