Optimizing a Bit-Wise Manipulation Kernel

Question

I have the following code which progressively goes through a string of bits and rearrange them into blocks of 20bytes. I'm using 32*8 blocks with 40 threads per block. However the process takes something like 36ms on my GT630M. Are there any further optimization I can do? Especially with regard to removing the if-else in the inner most loop.

__global__ void test(unsigned char *data)
{
    __shared__ unsigned char dataBlock[20];
    __shared__ int count;
    count = 0;

    unsigned char temp = 0x00;

    for(count=0; count<(streamSize/8); count++)
    {
        for(int i=0; i<8; i++)
        {
            if(blockIdx.y >= i)
                temp |= (*(data + threadIdx.x*(blockIdx.x + gridDim.x*(i+count)))&(0x01<>(blockIdx.y - i);
            else
                temp |= (*(data + threadIdx.x*(blockIdx.x + gridDim.x*(i+count)))&(0x01<

Optimizing a Bit-Wise Manipulation Kernel

Answers (1)

Related Questions