OpenACC red-black Gauss-Seidel slower than CPU

Question

I added OpenACC directives to my red-black Gauss-Seidel solver for the Laplace equation (a simple heated plate problem), but the GPU-accelerated code is no faster than the CPU, even for large problems.

I also wrote a CUDA version, and that is much faster than both (for 512x512, on the order of 2 seconds compared to 25 for CPU and OpenACC).

Can anyone think of a reason for this discrepancy? I realize that CUDA offers the most potential speed, but OpenACC should give something better than the CPU for larger problems (like the Jacobi solver for the same sort of problem demonstrated here).

Here is the relevant code (the full working source is here):

#pragma acc data copyin(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size]) copy(temp_red[0:size_temp], temp_black[0:size_temp])
// red-black Gauss-Seidel with SOR iteration loop
for (iter = 1; iter <= it_max; ++iter) {
  Real norm_L2 = 0.0;

  // update red cells
  #pragma omp parallel for shared(aP, aW, aE, aS, aN, temp_black, temp_red) \
      reduction(+:norm_L2)
  #pragma acc kernels present(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size], temp_red[0:size_temp], temp_black[0:size_temp])
  #pragma acc loop independent gang vector(4)
  for (int col = 1; col < NUM + 1; ++col) {
    #pragma acc loop independent gang vector(64)
    for (int row = 1; row < (NUM / 2) + 1; ++row) {

      int ind_red = col * ((NUM / 2) + 2) + row;        // local (red) index
      int ind = 2 * row - (col % 2) - 1 + NUM * (col - 1);  // global index

      #pragma acc cache(aP[ind], b[ind], aW[ind], aE[ind], aS[ind], aN[ind])

      Real res = b[ind] + (aW[ind] * temp_black[row + (col - 1) * ((NUM / 2) + 2)]
                         + aE[ind] * temp_black[row + (col + 1) * ((NUM / 2) + 2)]
                         + aS[ind] * temp_black[row - (col % 2) + col * ((NUM / 2) + 2)]
                         + aN[ind] * temp_black[row + ((col + 1) % 2) + col * ((NUM / 2) + 2)]);

      Real temp_old = temp_red[ind_red];
      temp_red[ind_red] = temp_old * (1.0 - omega) + omega * (res / aP[ind]);

      // calculate residual
      res = temp_red[ind_red] - temp_old;
      norm_L2 += (res * res);

    } // end for row
  } // end for col

  // update black cells
  #pragma omp parallel for shared(aP, aW, aE, aS, aN, temp_black, temp_red) \
          reduction(+:norm_L2)
  #pragma acc kernels present(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size], temp_red[0:size_temp], temp_black[0:size_temp])
  #pragma acc loop independent gang vector(4)
  for (int col = 1; col < NUM + 1; ++col) {
    #pragma acc loop independent gang vector(64)
    for (int row = 1; row < (NUM / 2) + 1; ++row) {

      int ind_black = col * ((NUM / 2) + 2) + row;      // local (black) index
      int ind = 2 * row - ((col + 1) % 2) - 1 + NUM * (col - 1);    // global index

      #pragma acc cache(aP[ind], b[ind], aW[ind], aE[ind], aS[ind], aN[ind])

      Real res = b[ind] + (aW[ind] * temp_red[row + (col - 1) * ((NUM / 2) + 2)]
                         + aE[ind] * temp_red[row + (col + 1) * ((NUM / 2) + 2)]
                         + aS[ind] * temp_red[row - ((col + 1) % 2) + col * ((NUM / 2) + 2)]
                         + aN[ind] * temp_red[row + (col % 2) + col * ((NUM / 2) + 2)]);

      Real temp_old = temp_black[ind_black];
      temp_black[ind_black] = temp_old * (1.0 - omega) + omega * (res / aP[ind]);

      // calculate residual
      res = temp_black[ind_black] - temp_old;       
      norm_L2 += (res * res);

    } // end for row
  } // end for col

  // calculate residual
  norm_L2 = sqrt(norm_L2 / ((Real)size));

  if(iter % 100 == 0) printf("%5d, %0.6f
", iter, norm_L2);

  // if tolerance has been reached, end SOR iterations
  if (norm_L2 < tol) {
    break;
  }
}

Kyle Niemeyer · Accepted Answer

Alright, I found a semi-solution that reduces the time somewhat significantly for smaller problems.

If I insert the lines:

acc_init(acc_device_nvidia);
acc_set_device_num(0, acc_device_nvidia);

before I start my timer, in order to activate and set the GPU, the time for the 512x512 problem drops to 9.8 seconds, and down to 42 for 1024x1024. Increasing the problem size further shows how fast even OpenACC can be compared to running on four CPU cores.

With this change, the OpenACC code is on the order of 2x slower than the CUDA code, with the gap getting closer to just a bit slower (~1.2) as the problem size gets bigger and bigger.

OpenACC red-black Gauss-Seidel slower than CPU

Answers (2)

Related Questions