guerrillacodester
guerrillacodester

Reputation: 531

Best way to achieve CUDA Vector Diagonalization

What I want to do is feed in my m x n matrix, and in parallel, construct n square diagonal matrices for each column of the matrix, perform an operation on each square diagonal matrix, and then recombine the result. How do I do this?

So far, I start of with an m x n matrix; the result from a previous matrix computation where each element is calculated using the function y = f(g(x)).

This gives me a matrix with n column elements [f1, f2...fn] where each fn represents a column vector of height m.

From here, I want to differentiate each column of the matrix with respect to g(x). Differentiating fn(x) w.r.t. g(x) results in a square matrix with elements f'(x). Under constraint, this square matrix reduces to a Jacobian with the elements of each row along the diagonal of the square matrix, and equal to fn', all other elements equaling zero.

Hence the reason why it is necessary to construct the diagonal for each of the vector rows fn.

To do this, I take a target vector defined as A(hA x 1) which was extracted from the larger A(m x n) matrix. I then prepared a zeroed matrix defined as C(hA x hA) which will be used to hold the diagonals.

The aim is to diagonalize the vector A into a square matrix with each element of A sitting on the diagonal of C, everything else being zero.

There are probably more efficient ways to accomplish this using some pre-built routine without building a whole new kernel, but please be aware that for these purposes, this method is necessary.

The kernel code (which works) to accomplish this is shown here:

_cudaDiagonalizeTest << <5, 1 >> >(d_A, matrix_size.uiWA, matrix_size.uiHA, d_C, matrix_size.uiWC, matrix_size.uiHC);

__global__ void _cudaDiagonalizeTest(float *A, int wA, int hA, float *C, int wC, int hC)
{
    int ix, iy, idx;

    ix = blockIdx.x * blockDim.x + threadIdx.x;
    iy = blockIdx.y * blockDim.y + threadIdx.y;

    idx = iy * wA + ix;

    C[idx * (wC + 1)] = A[idx];

}

I am a bit suspicious that this is a very naive approach to a solution and was wondering if someone could give an example of how I could do the same using

a) reduction

b) thrust

For vectors of large row size, I would like to be able to use the GPU's multithreading capabilities to chunk the task into small jobs, and combine each result at the end with __syncthreads().

The picture below shows what the desired result is.

I have read NVIDIA's article on reduction, but did not manage to achieve the desired results.

Any assistance or explanation would be very much welcomed.

enter image description here Thanks.

enter image description here

Matrix A is the target with 4 columns. I want to take each column, and copy its elements into Matrix B as a diagonal, iterating through each column.

Upvotes: 1

Views: 1593

Answers (2)

guerrillacodester
guerrillacodester

Reputation: 531

An alternate answer that does not use thrust is as follows:

_cudaMatrixTest << <5, 5 >> >(d_A, matrix_size.uiWA, matrix_size.uiHA, d_C, matrix_size.uiWC, matrix_size.uiHC);

__global__ void _cudaMatrixTest(float *A, int wA, int hA, float *C, int wC, int hC)
{
    int ix, iy, idx;

    ix = blockIdx.x * blockDim.x + threadIdx.x;
    iy = blockIdx.y * blockDim.y + threadIdx.y;

    idx = iy * wA + ix;

    C[idx * wC + (idx % wC)] = A[threadIdx.x * wA + (ix / wC)];
}

where d_A is

0    5    10    15    
1    6    11    16    
2    7    12    17    
3    8    13    18    
4    9    14    19    

Both answers are viable solutions. The question is, which is better/faster?

Upvotes: -1

m.s.
m.s.

Reputation: 16344

I created a simple example based on thrust. It uses column-major order to store the matrices in a thrust::device_vector. It should scale well with larger row/column counts.

Another approach could be based off the thrust strided_range example.

This example does what you want (fill the diagonals based on the input vector). However, depending on how you proceed with the resulting matrix to your "Differentiating" step, it might still be worth investigating if a sparse storage (without all the zero entries) is possible, since this will reduce memory consumption and ease iterating.

#include <thrust/device_vector.h>
#include <thrust/scatter.h>
#include <thrust/sequence.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/functional.h>
#include <iostream>


template<typename V>
void print_matrix(const V& mat, int rows, int cols)
{
   for(int i = 0; i < rows; ++i)
   {
     for(int j = 0; j < cols; ++j)
     {
      std::cout << mat[i + j*rows] << "\t";
     }
     std::cout << std::endl;
   }
}

struct diag_index : public thrust::unary_function<int,int>
{
  diag_index(int rows) : rows(rows){}

  __host__ __device__
  int operator()(const int index) const
  {
      return (index*rows + (index%rows));
  }

  const int rows;
};

int main()
{
  const int rows = 5; 
  const int cols = 4;

  // allocate memory and fill with demo data
  // we use column-major order
  thrust::device_vector<int> A(rows*cols);
  thrust::sequence(A.begin(), A.end());

  thrust::device_vector<int> B(rows*rows*cols, 0);

  // fill diagonal matrix
  thrust::scatter(A.begin(), A.end(), thrust::make_transform_iterator(thrust::make_counting_iterator(0),diag_index(rows)), B.begin());

  print_matrix(A, rows, cols);
  std::cout << std::endl;
  print_matrix(B, rows, rows*cols);
  return 0;
}

This example will output:

0    5    10    15    
1    6    11    16    
2    7    12    17    
3    8    13    18    
4    9    14    19    

0    0    0    0    0    5    0    0    0    0    10    0    0    0    0    15    0    0    0    0    
0    1    0    0    0    0    6    0    0    0    0    11    0    0    0    0    16    0    0    0    
0    0    2    0    0    0    0    7    0    0    0    0    12    0    0    0    0    17    0    0    
0    0    0    3    0    0    0    0    8    0    0    0    0    13    0    0    0    0    18    0    
0    0    0    0    4    0    0    0    0    9    0    0    0    0    14    0    0    0    0    19    

Upvotes: 2

Related Questions