Reputation: 11
I am working on a matrix-transpose in-place-implementation, with the following idea: Let the indices only take values that correspond to upper triangular entries (since diagonal ones stay constant during the transformation) and exchange values A[i,j] and A[j,i] within one thread via a temporary integer.
I am having trouble to assign the correct next index pair A[i_next,j_next] to the threat currently working on A[i,j] (this corresponds to the last two lines in the code snippet). How do I have to modify these two lines in order to access only the correct matrix entries for each threat?
Example:
for N=5; total-threads=3;
This is a 5x5 matrix, where the entries hold the thread_id of the threat they shall be accessed by. Those belonging to the first thread are italic as well (just to show which indexes I want row_idx, col_idx to assume for that specific thread.
Col0 | Col1 | Col2 | Col3 | Col4 |
---|---|---|---|---|
/ | 1 | 2 | 3 | 1 |
/ | / | 2 | 3 | 1 |
/ | / | / | 2 | 3 |
/ | / | / | / | 1 |
/ | / | / | / | / |
__global__ void in_place_transpose(double *A, int N)
{
int t_idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int total_threads = blockDim.x * gridDim.x;
int row_idx = t_idx / N;
int col_idx = t_idx % N;
// only access upper triangular entries
if (row_idx < col_idx)
{
while (row_idx < (N - 1) && col_idx < N) // A[N-1,N] is the last entry we change
{
int to = row_idx * N + col_idx;
int from = col_idx * N + row_idx;
if (from != to) // daigonal is constant
{
// printf("from: %d, to: %d \n", from, to); // show operation pair
// printf("th: %d, row: %d, col: %d \n", t_idx, row_idx, col_idx); // show indizes
int temp = A[to];
A[to] = A[from];
A[from] = temp;
}
// assign next matrix entry to thread (FAULTY)
row_idx = (int)((row_idx + total_threads) % (N - row_idx - 1));
col_idx = (int)(col_idx + (total_threads % (N - row_idx - 1)) % (N - row_idx - 1));
}
}
}
Upvotes: 1
Views: 304
Reputation: 151799
I should have mentioned that I want this to work for N=10^8, so creating a grid with enough threads isn't an option unfortunately. Regarding performance, you are right of course, this idea is not very fast.. However, I still want to make it work before moving on to a more optimized version
With that preamble, I believe what you are asking for is how to convert a linear index into a 2-dimensional triangular index.
That problem can be solved without using iteration, recursion, or floating-point arithmetic, or square-root or other advanced functions, just addition, subtraction, multiplication, and division.
Once we reduce the problem to one that can be solved using a linear index, then we can apply methodologies like grid-stride loop on a linear index, with conversion to 2D triangular "on the fly" to allow an arbitrary number of threads to work on and complete the problem.
Based on those ideas, here is one possible approach:
$ cat t2137.cu
#include <iostream>
// conversion of linear index to triangular matrix (row,col) index
__host__ __device__ void linear_to_triangular(const unsigned i, const unsigned n, unsigned &trow, unsigned &tcol)
{
int c = i;
int row = c/(n-1);
int col = c - (row*(n-1)); // effectively modulo
if (col < row) {
col = (n-2)-col;
row = (n-1)-row;
}
tcol = col+1;
trow = row;
}
const size_t mdim = 547;
using mt = int;
const size_t tdim = 136;
const unsigned bdim = 64;
template <typename T>
__global__ void t(T *m, const size_t n){
for (int idx = blockIdx.x*blockDim.x+threadIdx.x; idx < ((n*(n-1)) >> 1); idx += gridDim.x*blockDim.x){ // 1D grid-stride loop
unsigned i,j;
linear_to_triangular(idx, n, i, j);
T e1 = m[i*n+j];
T e2 = m[j*n+i];
m[i*n+j] = e2;
m[j*n+i] = e1;
}
}
int main(){
mt *d, *h;
h = new mt[mdim*mdim];
cudaMalloc(&d, mdim*mdim*sizeof(mt));
for (int i = 0; i < mdim*mdim; i++) h[i] = i;
cudaMemcpy(d, h, mdim*mdim*sizeof(mt), cudaMemcpyHostToDevice);
t<<<(tdim+bdim-1)/bdim, bdim>>>(d, mdim);
cudaMemcpy(h, d, mdim*mdim*sizeof(mt), cudaMemcpyDeviceToHost);
for (int i = 0; i < mdim*mdim; i++) {
int row = i/mdim;
int col = i - row*mdim;
if (h[col*mdim+row] != i) {std::cout << "Mismatch at: " << i << " was: " << h[col*mdim+row] << " should be: " << i << std::endl; return 0;}
}
}
$ nvcc -o t2137 t2137.cu
$ compute-sanitizer ./t2137
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$
I don't see much point in going deeply into performance. A naive transpose such as what you have shown can have one "coalesced" half but will then have the other half (badly) "uncoalesced". As far as that goes, this algorithm isn't any different. The linear-to-triangular conversion will still tend to create indices such that adjacent threads will often be loading (or storing, depending on how you write it) adjacent elements, but of course the staggered pattern of the triangular data patch introduces discontinuities in the adjacency pattern. And if your loads are partly coalesced, your stores won't be.
Upvotes: 1