gamerx
gamerx

Reputation: 579

Optimizing a Bit-Wise Manipulation Kernel

I have the following code which progressively goes through a string of bits and rearrange them into blocks of 20bytes. I'm using 32*8 blocks with 40 threads per block. However the process takes something like 36ms on my GT630M. Are there any further optimization I can do? Especially with regard to removing the if-else in the inner most loop.

__global__ void test(unsigned char *data)
{
    __shared__ unsigned char dataBlock[20];
    __shared__ int count;
    count = 0;

    unsigned char temp = 0x00;

    for(count=0; count<(streamSize/8); count++)
    {
        for(int i=0; i<8; i++)
        {
            if(blockIdx.y >= i)
                temp |= (*(data + threadIdx.x*(blockIdx.x + gridDim.x*(i+count)))&(0x01<<blockIdx.y))>>(blockIdx.y - i);
            else
                temp |= (*(data + threadIdx.x*(blockIdx.x + gridDim.x*(i+count)))&(0x01<<blockIdx.y))<<(i - blockIdx.y); 
        }
        dataBlock[threadIdx.x] = temp;  
            //do something

    }

}

Upvotes: 0

Views: 149

Answers (1)

ArchaeaSoftware
ArchaeaSoftware

Reputation: 4422

It's not clear what your code is trying to accomplish, but a couple obvious opportunities are:

1) if possible, use 32-bit words instead of unsigned char.

2) use block sizes that are multiples of 32.

3) The conditional code may not be costing you as much as you expect. You can check by compiling with --cubin --gpu-architecture sm_xx (where xx is the SM version of your target hardware), and using cuobjdump --dump-sass on the resulting cubin file to look at the generated assembly. You may have to modify the source code to loft the common subexpression into a separate variable, and/or use the ternary operator ? : to hint to the compiler to use predication.

Upvotes: 1

Related Questions