Reputation: 398
When using an OpenACC "#pragma acc routine worker
"-routine, that contains multiple loops of vector (and worker) level parallelism, how do vector_length
and num_workers
work?
I played around with some code (see below) and stumbled upon a few things:
vector_length(#)
clause on the outer parallel
region seems to work weirdly, when comparing run times. When I increase the vector length to huge numbers, say e.g. 4096
, the run time actually gets smaller. In my understanding, a huge amount of threads should lie dormant when there are only as many as 10
iterations in the vector loop. Am I doing something wrong here?foo()
. If it is 16
or smaller, the output is "correct". If it is 32
and even much larger, the loops inside the worker routine somehow get executed twice. What am I missing here?Can someone give me a hand with the OpenACC routine
clause? Many thanks in advance.
Here is the example code:
#include <iostream>
#include <chrono>
class A{
public:
int out;
int* some_array;
A(){
some_array = new int[1000*100*10];
for(int i = 0; i < 1000*100*10; ++i){
some_array[i] = 1;
}
#pragma acc enter data copyin(this, some_array[0:1000*100*10])
};
~A(){
#pragma acc exit data delete(some_array, this)
delete [] some_array;
}
#pragma acc routine worker
void some_worker(int i){
int private_out = 10;
#pragma acc loop vector reduction(+: private_out)
for(int j=0; j < 10; ++j){
//do some stuff
private_out -= some_array[j];
}
#pragma acc loop reduction(+: private_out) worker
for(int j=0; j < 100; ++j){
#pragma acc loop reduction(+: private_out) vector
for(int k=0; k < 10; ++k){
//do some other stuff
private_out += some_array[k+j*10+i*10*100];
}
}
#pragma acc atomic update
out += private_out;
}
void foo(){
#pragma acc data present(this, some_array[0:1000*100*10]) pcreate(out)
{
#pragma acc serial
out=0;
//#######################################################
//# setting num_workers and vector_length produce weird #
//# results and runtimes #
//#######################################################
#pragma acc parallel loop gang num_workers(64) vector_length(4096)
for(int i=0; i < 1000; ++i){
some_worker(i);
}
#pragma acc update host(out)
}
}
};
int main() {
using namespace std::chrono;
A a;
auto start = high_resolution_clock::now();
a.foo();
auto stop = high_resolution_clock::now();
std::cout << a.out << std::endl
<< "took " << duration_cast<microseconds>(stop - start).count() << "ms" << std::endl;
//output for num_workers(16) vector_length(4096)
//1000000
//took 844ms
//
//output for num_workers(16) vector_length(2)
//1000000
//took 1145ms
//
//output for num_workers(32) vector_length(2)
//1990000
//took 1480ms
//
//output for num_workers(64) vector_length(1)
//1990000
//took 502ms
//
//output for num_workers(64) vector_length(4096)
//1000000
//took 853ms
return 0;
}
Machine specs: nvc++ 21.3-0 with OpenACC 2.7, Tesla K20c with cc35, NVIDIA-driver 470.103.01 with CUDA 11.4
I simply used some printf
s in the worker to look into the intermediate results. I placed them during the implicit barriers between the loops. I could see that the value of private_out
went from initially 10
-10
instead of 0
between the loops and1990
instead of 1000
.This just looks to me like both loops are being executed twice.
To add some strangeness of this example: The code does not compile for some combinations of num_workers
/vector_length
. For e.g leaving num_workers
just at 64
and setting the vector_length
to 2
,4
,8
,16
and even to 32
(which increases the threads over the limit of 1024). It gives the error message
ptxas error : Entry function '_ZN1A14foo_298_gpu__1Ev' with max regcount of 32 calls function '_ZN1A11some_workerEi' with regcount of 41
However, simply inserting the printf
s as described above, it suddenly compiles fine but runs into a runtime error: "call to cuLaunchKernel returned error 1: Invalid value".
But the most strange is, that it compiles and runs fine for 64
/64
but returns incorrect results. Below is the output of this setting with NV_ACC_TIME=1
, but note that the output is almost exactly the same for all compiling and running configurations, except for the block: [1x#-######]
-part.
Accelerator Kernel Timing data
/path/to/src/main.cpp
_ZN1AC1Ev NVIDIA devicenum=0
time(us): 665
265: data region reached 1 time
265: data copyin transfers: 3
device time(us): total=665 max=650 min=4 avg=221
/path/to/src/main.cpp
_ZN1AD1Ev NVIDIA devicenum=0
time(us): 8
269: data region reached 1 time
269: data copyin transfers: 1
device time(us): total=8 max=8 min=8 avg=8
/path/to/src/main.cpp
_ZN1A3fooEv NVIDIA devicenum=0
time(us): 1,243
296: data region reached 2 times
298: compute region reached 2 times
298: kernel launched 2 times
grid: [1-1000] block: [1-32x1-24]
device time(us): total=1,230 max=1,225 min=5 avg=615
elapsed time(us): total=1,556 max=1,242 min=314 avg=778
304: update directive reached 1 time
304: data copyout transfers: 1
device time(us): total=13 max=13 min=13 avg=13
Upvotes: 1
Views: 856
Reputation: 5646
The exact mapping of workers and vectors will depend on the target device and implementation. Specifically when using NVHPC targeting NVIDIA GPUs, a "gang" maps to a CUDA Block, "worker" maps the the y dimension of a thread block, and "vector" to the x-dimension. The value used in "num_workers" or "vector_length" may be reduced given the constrains of the target. CUDA Blocks can contain up to a maximum 1024 threads so the "4096" value will be reduced to what is allowed by the hardware. Secondly, in order to support vector reductions in device routines, a maximum vector_length can be 32. In other words, you're "4096" value is actually "32" due to these constraints.
Note to see the max thread block size on your device, run the "nvaccelinfo" utility and look for the "Maximum Threads per Block" and "Maximum Block Dimensions" fields. Also, setting the environment variable "NV_ACC_TIME=1" will have the runtime produce some basic profiling information, including the actual number of blocks and thread block size used during the run.
In my understanding, a huge amount of threads should lie dormant when there are only as many as 10 iterations in the vector loop.
CUDA threads are grouped into a "warp" of 32 threads where all threads of a warp execute the same instructions concurrently (aka SIMT or single instruction multiple threads). Hence even though only 10 threads are doing useful work, the remaining 12 are not dormant. Plus they still take resources such as registers so adding too many threads for loops with lower trip counts, may actually hurt performance.
In this case setting the vector length to 1 is most likey the best case since the warp can now be comprised of the y-dimension threads. Setting it to 2, will cause a full 32 thread warp in the x-dimension, but only 2 doing useful work.
As to why some combinations give incorrect results, I didn't investigate. Routine worker, especially with reductions, is rarely used so it's possible we have some type of code gen issue, like an off-by one error in the reduction, at these irregular schedule sizes. I'll look into this later and determine if I need to file an issue report.
For #2, How you're determining it's getting run twice? Is just this based on the runtime?
Upvotes: 2