grube
grube

Reputation: 11

CUDA-matrix-transpose-in-place-implementation for NxN matrices that iterates over upper traingular entries only

I am working on a matrix-transpose in-place-implementation, with the following idea: Let the indices only take values that correspond to upper triangular entries (since diagonal ones stay constant during the transformation) and exchange values A[i,j] and A[j,i] within one thread via a temporary integer.

I am having trouble to assign the correct next index pair A[i_next,j_next] to the threat currently working on A[i,j] (this corresponds to the last two lines in the code snippet). How do I have to modify these two lines in order to access only the correct matrix entries for each threat?

Example: for N=5; total-threads=3;

This is a 5x5 matrix, where the entries hold the thread_id of the threat they shall be accessed by. Those belonging to the first thread are italic as well (just to show which indexes I want row_idx, col_idx to assume for that specific thread.

Col0 Col1 Col2 Col3 Col4
/ 1 2 3 1
/ / 2 3 1
/ / / 2 3
/ / / / 1
/ / / / /
__global__ void in_place_transpose(double *A, int N)
{
    int t_idx = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int total_threads = blockDim.x * gridDim.x;
    int row_idx = t_idx / N;
    int col_idx = t_idx % N;

    // only access upper triangular entries
    if (row_idx < col_idx)
    {
        while (row_idx < (N - 1) && col_idx < N) // A[N-1,N] is the last entry we change
        {
            int to = row_idx * N + col_idx;
            int from = col_idx * N + row_idx;

            if (from != to) // daigonal is constant
            {
                // printf("from: %d, to: %d \n", from, to); // show operation pair
                // printf("th: %d, row: %d, col: %d \n", t_idx, row_idx, col_idx); // show indizes
                int temp = A[to];
                A[to] = A[from];
                A[from] = temp;
            }
            // assign next matrix entry to thread (FAULTY)
            row_idx = (int)((row_idx + total_threads) % (N - row_idx - 1));
            col_idx = (int)(col_idx + (total_threads % (N - row_idx - 1)) % (N - row_idx - 1));
            
        }
    }
}

Upvotes: 1

Views: 304

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151799

I should have mentioned that I want this to work for N=10^8, so creating a grid with enough threads isn't an option unfortunately. Regarding performance, you are right of course, this idea is not very fast.. However, I still want to make it work before moving on to a more optimized version

  • I doubt that scale means you can't create enough threads, but we can set that aside.
  • I concur with the comments that are suggesting that you may want to think carefully about your algorithm starting point for efficient GPU usage, but you seem to have acknowledged that already.

With that preamble, I believe what you are asking for is how to convert a linear index into a 2-dimensional triangular index.

That problem can be solved without using iteration, recursion, or floating-point arithmetic, or square-root or other advanced functions, just addition, subtraction, multiplication, and division.

Once we reduce the problem to one that can be solved using a linear index, then we can apply methodologies like grid-stride loop on a linear index, with conversion to 2D triangular "on the fly" to allow an arbitrary number of threads to work on and complete the problem.

Based on those ideas, here is one possible approach:

$ cat t2137.cu
#include <iostream>

// conversion of linear index to triangular matrix (row,col) index
__host__ __device__ void linear_to_triangular(const unsigned i, const unsigned n, unsigned &trow, unsigned &tcol)
{
    int c = i;
    int row = c/(n-1);
    int col = c - (row*(n-1)); // effectively modulo

    if (col < row) {
      col = (n-2)-col;
      row = (n-1)-row;
    }

    tcol = col+1;
    trow = row;
}

const size_t mdim = 547;
using mt = int;
const size_t tdim = 136;
const unsigned bdim = 64;

template <typename T>
__global__ void t(T *m, const size_t n){

  for (int idx = blockIdx.x*blockDim.x+threadIdx.x; idx < ((n*(n-1)) >> 1); idx += gridDim.x*blockDim.x){ // 1D grid-stride loop
    unsigned i,j;
    linear_to_triangular(idx, n, i, j);
    T e1 = m[i*n+j];
    T e2 = m[j*n+i];
    m[i*n+j] = e2;
    m[j*n+i] = e1;
    }
}

int main(){

  mt *d, *h;
  h = new mt[mdim*mdim];
  cudaMalloc(&d, mdim*mdim*sizeof(mt));
  for (int i = 0; i < mdim*mdim; i++) h[i] = i;
  cudaMemcpy(d, h, mdim*mdim*sizeof(mt), cudaMemcpyHostToDevice);
  t<<<(tdim+bdim-1)/bdim, bdim>>>(d, mdim);
  cudaMemcpy(h, d, mdim*mdim*sizeof(mt), cudaMemcpyDeviceToHost);
  for (int i = 0; i < mdim*mdim; i++) {
    int row = i/mdim;
    int col = i - row*mdim;
    if (h[col*mdim+row] != i) {std::cout << "Mismatch at: " << i << " was: " << h[col*mdim+row] << " should be: " << i << std::endl; return 0;}
    }
}

$ nvcc -o t2137 t2137.cu
$ compute-sanitizer ./t2137
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$

I don't see much point in going deeply into performance. A naive transpose such as what you have shown can have one "coalesced" half but will then have the other half (badly) "uncoalesced". As far as that goes, this algorithm isn't any different. The linear-to-triangular conversion will still tend to create indices such that adjacent threads will often be loading (or storing, depending on how you write it) adjacent elements, but of course the staggered pattern of the triangular data patch introduces discontinuities in the adjacency pattern. And if your loads are partly coalesced, your stores won't be.

Upvotes: 1

Related Questions