Computing the Euclidean distances between corresponding rows of matrices with CUDA

Question

I have a very simple algorithm that computes the squared Euclidean distances between the corresponding rows of two matrices. I have the following code but unfortunately it does not return the correct results for different matrix sizes. More specifically, it works fine for matrices of size 2000x4, 500x4, 2500x2, 600x8, 1000x8, 100x8 but it is not working for a matrix of size2500x3, 2500x5, 400x3, 100x3, 100x10, 1000x10, 1000x12, 500x12, 500x14.

Can anybody help me? I want to do it manually, without using any optimized library, because I want to understand thread management.

__global__ void cudaEuclid( float* A, float* B, float* C, int rows, int cols )
    {
        int i, squareeucldist = 0;
        int r = blockDim.x * blockIdx.x + threadIdx.x; // rows
        int c = blockDim.y * blockIdx.y + threadIdx.y; // cols 
        extern __shared__ float sdata[];
        //int r = blockIdx.y; int c = threadIdx.x;
        if( r < rows && c < cols  ){

            //C[r + rows*c] = ( A[r + rows*c] - B[r + rows*c] ) * ( A[r + rows*c] - B[r + rows*c] );


            sdata[threadIdx.x] = ( A[r + rows*c] - B[r + rows*c] ) * ( A[r + rows*c] - B[r + rows*c] );

            __syncthreads();

            // contiguous range pattern
            for(int offset = blockDim.x / 2;
                offset > 0;
                offset >>= 1)
            {
                if(threadIdx.x < offset)
                {
                    // add a partial sum upstream to our own
                    sdata[threadIdx.x] += sdata[threadIdx.x + offset];
                }

                // wait until all threads in the block have
                // updated their partial sums
                __syncthreads();
            }

            // thread 0 writes the final result
            if(threadIdx.x == 0)
            {
                C[r] = sdata[0];
            }

        }

    }

The kernel call is:

dim3 dimBlock( cols, 1 ); 
dim3 dimGrid( 1, rows ); 
cudaEuclid<<>>( d_A, d_B, d_C, rows, cols );

PS: I want to mention that I had posted a similar question but it was unclear from the beginning and the discussion was disoriented. Even though Tom made a very useful suggestion that it will be very practical in future for optimized implementations, I need something more handmade. Finally, the reason I made this post is because I do not want to make the related post more complicated. Thanks.

kangshiyin · Accepted Answer

In fact your code works only on m * 2^n when n is small enough. You probably want to read more carefully about the following slides on page 14,

http://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf

and think about the following questions

what will happen when your blockDim.x is equal to 3 or 5;
how the parallel reduction could be correctly done when blockDim.x or cols is not a power of 2;
why the reduction result is smaller than expected;
which element(s) in sdata[] is not added to the final sum;
will the result be correct if you set blockDim.x and size of smem to 2^3 when cols is 5;
in the case of q5, how to deal with the extra 3 element space in smem[5..7]

Try to simulate running the for loop step by step with your pen and paper will help.

Computing the Euclidean distances between corresponding rows of matrices with CUDA

Answers (2)

Related Questions