Darkmoor
Darkmoor

Reputation: 902

Sub-Matrix computations

I want to calculate the pair wise distance between two sub-matrices of a matrix. For example I have a matrix A (MxN) and two blocks of that matrix B1 (mxn) and B2 (kxt). More specifically, I want to calculate the distance of the B1(1,1) element from all the other elements of the B2 and to do this process for all the B1 elements. To be more clear the B1 and B2 may be not compact parts of the matrices and basically the information I know is the coordinates of the elements of B1 and B2 on the matrix A. Here is an example.

for(int i = 0; i < nRowsCoordsB1 ; i++ ){//nRows of B1
  for(int j = 0; j < nRowsCoordsB2 ; j++ ){//nRows of B2

    //CoordsofB1 is a nRowsB1x2 matrix that contains the element coordinates of the B1 sub matrix

    a_x = CoordsofB1[ i ]; //take the x coord of the corresponding row i
    a_y = CoordsofB1[ i + nRowsCoordsB1 ]; //take the y coord of the corresponding row

    b_x = CoordsofB2[ j ];
    b_y = CoordsofB2[ j + nRowsCoordsB2 ];

    int element1 = A[ a_x + a_y*nRowsofA ];
    int element2 = A[ b_x + b_y*nRowsofA ] ;
    sum +=abs( element1 - element2 ) ;

  }
}
*Output = sum/(float)(numberOfElementsofB1*numberOfElementsofB2);   

Now I want to speedup computations with CUDA :) Because I am new in Cuda perspective I found it a little complicated. Since now I think that I have understand the logic of allocating block threads in Matrix level but here the fact that I have two different parts of the matrix with different size, CoordsofB1 and CoordsofB2, confuse me a little on how I can access them take the coordinates and use them in the hole matrix. I thought that we should work in A using constrains but I did not come with a clear thought.

Also the fact that in the end of the for loops the sum is divided with a quantity confuse me on who we would combined in the cuda translated code.

Any suggestions-snippets-examples-references would be great.

PS: the reason I use column-major ordering is because the code is evaluated in matlab.

UPDATE: Can we allocate thread block of size equal the size of the biggest sub matrix B1 or B2 and work with them using the correct conditions? I comment the last line because I was not sure about what to do with it. Any comments?

int r = blockDim.x * blockIdx.x + threadIdx.x; // rows
if( r < nRowsCoordsB1 ){       

  a_x = CoordsofB1[ r ]; 
  a_y = CoordsofB1[ r + nRowsCoordsB1 ]; 
  if( r < nRowsCoordsB2 ;){

    b_x = CoordsofB2[ r ];
    b_y = CoordsofB2[ r + nRowsCoordsB2 ];
    int element1 = A[ a_x + a_y*nRowsofA ];
    int element2 = A[ b_x + b_y*nRowsofA ] ;
    sum +=abs( element1 - element2 ) ;

  }
}
//*Output = sum/(float)(numberOfElementsofB1*numberOfElementsofB2); 

Here a sketch enter image description here

I have the coordinates of each element inside the B1 and B2 and I want to calculate the differences between the values in

[ (B1(1,1) - B2(1,1)) + (B1(1,1) - B2(1,2)) + ... + (B1(1,1) - B2(:,:)) ] +

[ (B1(1,2) - B2(1,1)) + (B1(1,2) - B2(1,2)) + ... + (B1(1,2) - B2(:,:)) ] +

[ (B1(:,:) - B2(1,1)) + (B1(:,:) - B2(1,2)) + ... + (B1(:,:) - B2(:,:)) ].

Upvotes: 0

Views: 531

Answers (2)

Vitality
Vitality

Reputation: 21455

Perhaps the solution below using a 2D thread grid, could be an alternative to Eric's use of thrust to have some more insight to the problem.

The code snippet below is to illustrate the concept only. It is an untested code.

2D grid

Define a partial_distances matrix of size nRowsCoordsB1 X nRowsCoordsB2 that will contain all the involved absolute value differences between the elements of B1 and B2. In the main file you will have

__global__ void distance_calculator(int* partial_distances, int* CoordsofB1, int* CoordsofB2, int nRowsCoordsB1, int nRowsCoordsB2) {

    int i = blockDim.x * blockIdx.x + threadIdx.x;
    int j = blockDim.y * blockIdx.y + threadIdx.y;

    int a_x = CoordsofB1[i]; 
    int a_y = CoordsofB1[i+nRowsCoordsB1];

    int b_x = CoordsofB2[j];
    int b_y = CoordsofB2[j+nRowsCoordsB2];

    partial_distances[j*nRowsCoordsB1+i] = abs(A[a_x+a_y*nRowsofA]-A[b_x+b_y*nRowsofA]);

}

int iDivUp(int a, int b) { return (a % b != 0) ? (a / b + 1) : (a / b); }

#define BLOCKSIZE 32

int main() {

    int* partial_distances; cudaMalloc((void**)&partial_distances,nRowsCoordsB1*nRowsCoordsB2*sizeof(int));

    dim3 BlocSize(BLOCKSIZE,BLOCKSIZE);
    dim3 GridSize;
    GridSize.x = iDivUp(nRowsCoordsB1,BLOCKSIZE);
    GridSize.y = iDivUp(nRowsCoordsB2,BLOCKSIZE);

    distance_calculator<<<GridSize,BlockSize>>>(partial_distances,CoordsofB1,CoordsofB2,nRowsCoordsB1,nRowsCoordsB2);

   REDUCTION_STEP

}

The REDUCTION_STEP could be implemented as the iterative call to a 1D reduction kernel to sum up all the elements corresponding to a particular element of B1.

An alternative would be to use dynamic parallelism to call the reduction routine directly within the kernel, but this is an option not suitable to the card you are using.

Upvotes: 1

kangshiyin
kangshiyin

Reputation: 9771

If I understand it correctly, what you are trying to do can be written in the following matlab code.

rep_B1 = repmat(B1(:),  1, length(B2(:)) );
rep_B2 = repmat(B2(:)', length(B1(:), 1) );
absdiff_B1B2 = abs(rep_B1 - repB2);
Result = mean( absdiff_B1B2(:) );

Your will notice that before the reduction, there is a matrix absdiff_B1B2 of the size length(B1(:)) x length(B2(:)), i.e. m*n x k*t (this matrix is never stored to global mem if you implement the above code in one CUDA kernel). You could partition this matrix into 16x16 sub-matrices and use one 256-thread-block per sub-matrix to decompose the workload to GPU.

On the other hand, you could use thrust to make your life easier.

Update

Since B1 and B2 are sub-matrices of A, you could first use cudaMemcpy2D() to copy them to linear space, then use a kernel to construct and then reduce the matrix absdiff_B1B2.

For the final normalization operation (last line of your code), you could do it on CPU.

Here's the code using thrust to show how to construct and reduce the matrix absdiff_B1B2 in a single kernel. However you will find that the construction procedure use no shared memory and is not optimized. Further optimization using shared mem will improve the performance.

#include <thrust/device_vector.h>
#include <thrust/inner_product.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>

template<typename T>
struct abs_diff
{
    inline __host__ __device__ T operator()(const T& x, const T& y)
    {
        return abs(x - y);
    }
};

int main()
{
    using namespace thrust::placeholders;

    const int m = 98;
    const int n = 87;
    int k = 76;
    int t = 65;
    double result;

    thrust::device_vector<double> B1(m * n, 1.0);
    thrust::device_vector<double> B2(k * t, 2.0);

    result = thrust::inner_product(
            thrust::make_permutation_iterator(
                    B1.begin(),
                    thrust::make_transform_iterator(
                            thrust::make_counting_iterator(0),
                            _1 % (m * n))),
            thrust::make_permutation_iterator(
                    B1.begin(),
                    thrust::make_transform_iterator(
                            thrust::make_counting_iterator(0),
                            _1 % (m * n))) + (m * n * k * t),
            thrust::make_permutation_iterator(
                    B2.begin(),
                    thrust::make_transform_iterator(
                            thrust::make_counting_iterator(0),
                            _1 / (m * n))),
            0.0,
            thrust::plus<double>(),
            abs_diff<double>());
    result /= m * n * k * t;

    std::cout << result << std::endl;

    return 0;
}

Upvotes: 1

Related Questions