Matrix the Rectangle Part transpose Cuda

Question

im writing Cuda Program to Transpose Square Matrix, the idea is to do it in two parts depending on size of matrix; the matrix size cut into even size with Tile , and remain rectangle part left i transpose it separately Ex: 67 x 67 Matrix with Tile : 32, first part is 64x64 transposed, then second part is 3x67.

my problem is in the rectangle part, first below code shows the main code with the defined values:

const int TILE_DIM = 32;
const int BLOCK_ROWS = 8;
const int NUM_REPS = 100;

const int Nx = 2024; //size of the matrix
const int Ny = 2024;

int main(int argc, char **argv)
{
const int nx = Nx;
const int ny = Ny; // Size of the Arrays
const int mem_size = nx*ny*sizeof(int);// Size of the Orig.Arr

int *h_idata = (int*)malloc(mem_size); // original Host Arr.

int *d_idata; //device Arr.
checkCuda(cudaMalloc(&d_idata, mem_size));

dim3 dimGridX(nx / TILE_DIM, 1, 1); //grid dimension used
dim3 dimBlockX(TILE_DIM, 1, 1); // number of threads used

// the Kernel Function for only the rectangle
EdgeTransposeX << < dimGrid, dimBlock >> >(d_idata);
cudaEventRecord(startEvent, 0);
cudaEventRecord(stopEvent, 0);
cudaEventSynchronize(stopEvent);
cudaEventElapsedTime(&ms, startEvent, stopEvent);
cudaMemcpy(h_idata, d_idata, mem_size, cudaMemcpyDeviceToHost);

the Kernel Code i was advised not to use shared, so below is how ive done :

__global__ void EdgeTransposeX(int *idata)
{

    int tile_C[Edge][Nx];
    int tile_V[Nx][Edge];

    int x = blockIdx.x * TILE_DIM + threadIdx.x;

    if (x == (nEven - 1))
    {

        for (int j = 0; j < Nx; j++)
            for (int i = 1; i <= Edge; i++)
            {

            tile_V[j][i - 1] = idata[j*Nx + (x + i)];
             tile_C[i - 1][j] = idata[(x + i)*Nx + j];}

         __syncthreads();

        for (int j = 0; j < Nx; j++)
          for (int i = 1; i <= Edge; i++)
         {
            idata[j*Nx + (x + i)] = tile_C[i - 1][j];
            idata[(x + i)*Nx + j] = tile_V[j][i - 1];}

       } }

the code works Okay until matrix size reaches 1025, after that it stops working, any idea why ? am i missing something here ?

Matrix the Rectangle Part transpose Cuda

Answers (1)

Related Questions