With the new maxwell architecture do i have to use shared memory?

Question

A lot of cuda samples show that you have to put data from global memory into shared memory before using it. For example let's consider a function that sums values in 5x5 squares. Profiler shows that version with no shared memory works like 20% faster. Do i have to put my data into shared memory or maxwell will put the data into L1 cache automatically?

Robert Crovella · Accepted Answer

Shared memory is still a useful optimization for many codes, even on Maxwell.

If you have a 2D stencil code (appears to be what you are describing) I would certainly expect the version that runs out of shared memory to perform faster, assuming you are doing the shared memory adaptation/usage correctly.

Here's a fully worked example of a 2D stencil code, in both shared memory and non-shared-memory versions, running on a GTX 960. The shared memory version runs about 33% faster:

non-shared memory version:

$ cat example3a_imp.cu
#include 
#include 
#include 
// these are just for timing measurments
#include 
// Code that reads values from a 2D grid and for each node in the grid finds the minumum
// value among all values stored in cells sharing that node, and stores the minumum
// value in that node.


//define the window size (square window) and the data set size
#define WSIZE 16
#define DATAHSIZE 8000
#define DATAWSIZE 16000
#define CHECK_VAL 1
#define MIN(X,Y) ((X>>(d_out, d_in);
    cudaCheckErrors("Kernel launch failure");
// copy output data back to host

    cudaMemcpy(h_out, d_out, ((nr*nc)*sizeof(int)), cudaMemcpyDeviceToHost);
    cudaCheckErrors("CUDA memcpy3 failure");
    t2 = clock();
    t2sum = ((double)(t2-t1))/CLOCKS_PER_SEC;
    printf ("Done. Compute took %f seconds
", t2sum);
    for (i=0; i < nc; i++)
      for (j=0; j < nr; j++)
        if (h_out[i][j] != CHECK_VAL) {printf("mismatch at %d,%d, was: %d should be: %d
", i,j,h_out[i][j], CHECK_VAL); return 1;}
    printf("Results pass
");

    return 0;
}

shared memory version:

$ cat example3b_imp.cu
#include 
#include 
// these are just for timing measurments
#include 
// Code that reads values from a 2D grid and for each node in the grid finds the minumum
// value among all values stored in cells sharing that node, and stores the minumum
// value in that node.


//define the window size (square window) and the data set size
#define WSIZE 16
#define DATAHSIZE 8000
#define DATAWSIZE 16000
#define CHECK_VAL 1
#define MIN(X,Y) ((X (BLKWSIZE - WSIZE))
        smem[threadIdx.y + (WSIZE-1)][threadIdx.x] = input[idy+(WSIZE-1)][idx];
      if (threadIdx.x > (BLKHSIZE - WSIZE))
        smem[threadIdx.y][threadIdx.x + (WSIZE-1)] = input[idy][idx+(WSIZE-1)];
      if ((threadIdx.x > (BLKHSIZE - WSIZE)) && (threadIdx.y > (BLKWSIZE - WSIZE)))
        smem[threadIdx.y + (WSIZE-1)][threadIdx.x + (WSIZE-1)] = input[idy+(WSIZE-1)][idx+(WSIZE-1)];
      __syncthreads();
      tempout = output[idy][idx];
      for (i=0; i>>(d_out, d_in);
    cudaCheckErrors("Kernel launch failure");
// copy output data back to host

    cudaMemcpy(h_out, d_out, ((nr*nc)*sizeof(int)), cudaMemcpyDeviceToHost);
    cudaCheckErrors("CUDA memcpy3 failure");
    t2 = clock();
    t2sum = ((double)(t2-t1))/CLOCKS_PER_SEC;
    printf ("Done. Compute took %f seconds
", t2sum);
    for (i=0; i < nc; i++)
      for (j=0; j < nr; j++)
        if (h_out[i][j] != CHECK_VAL) {printf("mismatch at %d,%d, was: %d should be: %d
", i,j,h_out[i][j], CHECK_VAL); return 1;}
    printf("Results pass
");

    return 0;
}

test:

$ nvcc -O3 -arch=sm_52 example3a_imp.cu -o ex3
$ nvcc -O3 -arch=sm_52 example3b_imp.cu -o ex3_shared
$ ./ex3
Begin init
Init took 0.986819 seconds.  Begin compute
Done. Compute took 2.162276 seconds
Results pass
$ ./ex3_shared
Begin init
Init took 0.987281 seconds.  Begin compute
Done. Compute took 1.522475 seconds
Results pass
$

With the new maxwell architecture do i have to use shared memory?

Answers (1)

Related Questions