Reputation: 301
I added OpenACC directives to my red-black Gauss-Seidel solver for the Laplace equation (a simple heated plate problem), but the GPU-accelerated code is no faster than the CPU, even for large problems.
I also wrote a CUDA version, and that is much faster than both (for 512x512, on the order of 2 seconds compared to 25 for CPU and OpenACC).
Can anyone think of a reason for this discrepancy? I realize that CUDA offers the most potential speed, but OpenACC should give something better than the CPU for larger problems (like the Jacobi solver for the same sort of problem demonstrated here).
Here is the relevant code (the full working source is here):
#pragma acc data copyin(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size]) copy(temp_red[0:size_temp], temp_black[0:size_temp])
// red-black Gauss-Seidel with SOR iteration loop
for (iter = 1; iter <= it_max; ++iter) {
Real norm_L2 = 0.0;
// update red cells
#pragma omp parallel for shared(aP, aW, aE, aS, aN, temp_black, temp_red) \
reduction(+:norm_L2)
#pragma acc kernels present(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size], temp_red[0:size_temp], temp_black[0:size_temp])
#pragma acc loop independent gang vector(4)
for (int col = 1; col < NUM + 1; ++col) {
#pragma acc loop independent gang vector(64)
for (int row = 1; row < (NUM / 2) + 1; ++row) {
int ind_red = col * ((NUM / 2) + 2) + row; // local (red) index
int ind = 2 * row - (col % 2) - 1 + NUM * (col - 1); // global index
#pragma acc cache(aP[ind], b[ind], aW[ind], aE[ind], aS[ind], aN[ind])
Real res = b[ind] + (aW[ind] * temp_black[row + (col - 1) * ((NUM / 2) + 2)]
+ aE[ind] * temp_black[row + (col + 1) * ((NUM / 2) + 2)]
+ aS[ind] * temp_black[row - (col % 2) + col * ((NUM / 2) + 2)]
+ aN[ind] * temp_black[row + ((col + 1) % 2) + col * ((NUM / 2) + 2)]);
Real temp_old = temp_red[ind_red];
temp_red[ind_red] = temp_old * (1.0 - omega) + omega * (res / aP[ind]);
// calculate residual
res = temp_red[ind_red] - temp_old;
norm_L2 += (res * res);
} // end for row
} // end for col
// update black cells
#pragma omp parallel for shared(aP, aW, aE, aS, aN, temp_black, temp_red) \
reduction(+:norm_L2)
#pragma acc kernels present(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size], temp_red[0:size_temp], temp_black[0:size_temp])
#pragma acc loop independent gang vector(4)
for (int col = 1; col < NUM + 1; ++col) {
#pragma acc loop independent gang vector(64)
for (int row = 1; row < (NUM / 2) + 1; ++row) {
int ind_black = col * ((NUM / 2) + 2) + row; // local (black) index
int ind = 2 * row - ((col + 1) % 2) - 1 + NUM * (col - 1); // global index
#pragma acc cache(aP[ind], b[ind], aW[ind], aE[ind], aS[ind], aN[ind])
Real res = b[ind] + (aW[ind] * temp_red[row + (col - 1) * ((NUM / 2) + 2)]
+ aE[ind] * temp_red[row + (col + 1) * ((NUM / 2) + 2)]
+ aS[ind] * temp_red[row - ((col + 1) % 2) + col * ((NUM / 2) + 2)]
+ aN[ind] * temp_red[row + (col % 2) + col * ((NUM / 2) + 2)]);
Real temp_old = temp_black[ind_black];
temp_black[ind_black] = temp_old * (1.0 - omega) + omega * (res / aP[ind]);
// calculate residual
res = temp_black[ind_black] - temp_old;
norm_L2 += (res * res);
} // end for row
} // end for col
// calculate residual
norm_L2 = sqrt(norm_L2 / ((Real)size));
if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, norm_L2);
// if tolerance has been reached, end SOR iterations
if (norm_L2 < tol) {
break;
}
}
Upvotes: 5
Views: 1513
Reputation: 153
I download your full code and i compiled and run it! Did't stop run and for instruction
if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, norm_L2);
the result was:
100, nan
200, nan
....
I changed all variables with type Real into type float and the result was:
100, 0.000654
200, 0.000370
..., ....
..., ....
8800, 0.000002
8900, 0.000002
9000, 0.000001
9100, 0.000001
9200, 0.000001
9300, 0.000001
9400, 0.000001
9500, 0.000001
9600, 0.000001
9700, 0.000001
CPU
Iterations: 9796
Total time: 5.594017 s
With NUM = 1024 the result was:
Iterations: 27271
Total time: 25.949905 s
Upvotes: 0
Reputation: 301
Alright, I found a semi-solution that reduces the time somewhat significantly for smaller problems.
If I insert the lines:
acc_init(acc_device_nvidia);
acc_set_device_num(0, acc_device_nvidia);
before I start my timer, in order to activate and set the GPU, the time for the 512x512 problem drops to 9.8 seconds, and down to 42 for 1024x1024. Increasing the problem size further shows how fast even OpenACC can be compared to running on four CPU cores.
With this change, the OpenACC code is on the order of 2x slower than the CUDA code, with the gap getting closer to just a bit slower (~1.2) as the problem size gets bigger and bigger.
Upvotes: 3