Sub-Matrix computations

Question

I want to calculate the pair wise distance between two sub-matrices of a matrix. For example I have a matrix A (MxN) and two blocks of that matrix B1 (mxn) and B2 (kxt). More specifically, I want to calculate the distance of the B1(1,1) element from all the other elements of the B2 and to do this process for all the B1 elements. To be more clear the B1 and B2 may be not compact parts of the matrices and basically the information I know is the coordinates of the elements of B1 and B2 on the matrix A. Here is an example.

for(int i = 0; i < nRowsCoordsB1 ; i++ ){//nRows of B1
  for(int j = 0; j < nRowsCoordsB2 ; j++ ){//nRows of B2

    //CoordsofB1 is a nRowsB1x2 matrix that contains the element coordinates of the B1 sub matrix

    a_x = CoordsofB1[ i ]; //take the x coord of the corresponding row i
    a_y = CoordsofB1[ i + nRowsCoordsB1 ]; //take the y coord of the corresponding row

    b_x = CoordsofB2[ j ];
    b_y = CoordsofB2[ j + nRowsCoordsB2 ];

    int element1 = A[ a_x + a_y*nRowsofA ];
    int element2 = A[ b_x + b_y*nRowsofA ] ;
    sum +=abs( element1 - element2 ) ;

  }
}
*Output = sum/(float)(numberOfElementsofB1*numberOfElementsofB2);

Now I want to speedup computations with CUDA :) Because I am new in Cuda perspective I found it a little complicated. Since now I think that I have understand the logic of allocating block threads in Matrix level but here the fact that I have two different parts of the matrix with different size, CoordsofB1 and CoordsofB2, confuse me a little on how I can access them take the coordinates and use them in the hole matrix. I thought that we should work in A using constrains but I did not come with a clear thought.

Also the fact that in the end of the for loops the sum is divided with a quantity confuse me on who we would combined in the cuda translated code.

Any suggestions-snippets-examples-references would be great.

PS: the reason I use column-major ordering is because the code is evaluated in matlab.

UPDATE: Can we allocate thread block of size equal the size of the biggest sub matrix B1 or B2 and work with them using the correct conditions? I comment the last line because I was not sure about what to do with it. Any comments?

int r = blockDim.x * blockIdx.x + threadIdx.x; // rows
if( r < nRowsCoordsB1 ){       

  a_x = CoordsofB1[ r ]; 
  a_y = CoordsofB1[ r + nRowsCoordsB1 ]; 
  if( r < nRowsCoordsB2 ;){

    b_x = CoordsofB2[ r ];
    b_y = CoordsofB2[ r + nRowsCoordsB2 ];
    int element1 = A[ a_x + a_y*nRowsofA ];
    int element2 = A[ b_x + b_y*nRowsofA ] ;
    sum +=abs( element1 - element2 ) ;

  }
}
//*Output = sum/(float)(numberOfElementsofB1*numberOfElementsofB2);

Here a sketch enter image description here

I have the coordinates of each element inside the B1 and B2 and I want to calculate the differences between the values in

[ (B1(1,1) - B2(1,1)) + (B1(1,1) - B2(1,2)) + ... + (B1(1,1) - B2(:,:)) ] +

[ (B1(1,2) - B2(1,1)) + (B1(1,2) - B2(1,2)) + ... + (B1(1,2) - B2(:,:)) ] +

[ (B1(:,:) - B2(1,1)) + (B1(:,:) - B2(1,2)) + ... + (B1(:,:) - B2(:,:)) ].

kangshiyin · Accepted Answer

If I understand it correctly, what you are trying to do can be written in the following matlab code.

rep_B1 = repmat(B1(:),  1, length(B2(:)) );
rep_B2 = repmat(B2(:)', length(B1(:), 1) );
absdiff_B1B2 = abs(rep_B1 - repB2);
Result = mean( absdiff_B1B2(:) );

Your will notice that before the reduction, there is a matrix absdiff_B1B2 of the size length(B1(:)) x length(B2(:)), i.e. m*n x k*t (this matrix is never stored to global mem if you implement the above code in one CUDA kernel). You could partition this matrix into 16x16 sub-matrices and use one 256-thread-block per sub-matrix to decompose the workload to GPU.

On the other hand, you could use thrust to make your life easier.

Update

Since B1 and B2 are sub-matrices of A, you could first use cudaMemcpy2D() to copy them to linear space, then use a kernel to construct and then reduce the matrix absdiff_B1B2.

For the final normalization operation (last line of your code), you could do it on CPU.

Here's the code using thrust to show how to construct and reduce the matrix absdiff_B1B2 in a single kernel. However you will find that the construction procedure use no shared memory and is not optimized. Further optimization using shared mem will improve the performance.

#include 
#include 
#include 
#include 
#include 

template
struct abs_diff
{
    inline __host__ __device__ T operator()(const T& x, const T& y)
    {
        return abs(x - y);
    }
};

int main()
{
    using namespace thrust::placeholders;

    const int m = 98;
    const int n = 87;
    int k = 76;
    int t = 65;
    double result;

    thrust::device_vector B1(m * n, 1.0);
    thrust::device_vector B2(k * t, 2.0);

    result = thrust::inner_product(
            thrust::make_permutation_iterator(
                    B1.begin(),
                    thrust::make_transform_iterator(
                            thrust::make_counting_iterator(0),
                            _1 % (m * n))),
            thrust::make_permutation_iterator(
                    B1.begin(),
                    thrust::make_transform_iterator(
                            thrust::make_counting_iterator(0),
                            _1 % (m * n))) + (m * n * k * t),
            thrust::make_permutation_iterator(
                    B2.begin(),
                    thrust::make_transform_iterator(
                            thrust::make_counting_iterator(0),
                            _1 / (m * n))),
            0.0,
            thrust::plus(),
            abs_diff());
    result /= m * n * k * t;

    std::cout << result << std::endl;

    return 0;
}

Sub-Matrix computations

Answers (2)

Update

Related Questions