Recker
Recker

Reputation: 1975

Cuda kernel - possible optimizations

Here is the kernel that I am launching for calculating some array in parallel.

__device__ bool mult(int colsize,int rowsize,int *Aj,int *Bi)
    {       
        for(int j = 0; j < rowsize;j++)
        {           
           for(int k = 0;k < colsize;k++)
            {   
              if(Aj[j] == Bi[k])
               {    
                return true;
                }                               
            }           
        }
            return false;       
    }


__global__ void kernel(int *Aptr,int *Aj,int *Bptr,int *Bi,int rows,int cols,int *Cjc)
    {
        int tid = threadIdx.x + blockIdx.x * blockDim.x;
        int i;
        if(tid < cols)
        {
            int beg = Bptr[tid];
            int end = Bptr[tid+1];
            for(i = 0;i < rows;i++)
            {
                int cbeg = Aptr[i];
                int cend = Aptr[i+1];
                if(mult(end - beg,cend - cbeg,Aj+cbeg,Bi+beg))
                {                                                
                     Cjc[tid+1] += 1;
                     //atomicAdd(Cjc+tid+1,1);           
                }
            }                
        }               
    }

My launch configurations and kernel call are as follows.

int numBlocks,numThreads;

        if(q % 32 == 0)
        {
            numBlocks = q/32;
            numThreads = 32;
        }
        else
        {
            numBlocks = (q+31)/32;
            numThreads = 32;
        }
findkernel<<<numBlocks,numThreads>>>(devAptr,devAcol,devBjc,devBir,m,q,d_Cjc);

I have to admit, this kernel is running pretty slow.Once I get the array back to host side, I use thrust::inclusive_scan to find my resultant array. My question is, is there any room for improvement / optimization for my kernel? I tried using shared memory but its producing either wrong answers or throwing runtime exceptions.

Also, how does the dynamically allocated shared memory ( which is allocated by third parameter in kernel launch ) is distributed among the blocks?

Any help/hints/insinuations will be appreciated. Thanks in advance.

Upvotes: 0

Views: 159

Answers (1)

Programmer
Programmer

Reputation: 6753

As for the shared memory allocated using kernel<<<blocks,threads,mem>>> mem is the amount of memory allocated each block. So each block gets mem amount of memory.

For your code, I don't understand why are there 2 for loops in the mult function. Just want to point out that each thread will be executing these 2 for loops. Moreover, as you also have a for loop in the kernel function, it means that each thread will be executing the 2 for loops in the mult function several times. THis is slow. Moreover, doing

int beg = Bptr[tid]; 
int end = Bptr[tid+1]; 

is not exactly coalesced access. Non coalesced access is slow.

Upvotes: 1

Related Questions