How does vector_length and num_workers work in an OpenACC routine?

Question

When using an OpenACC "#pragma acc routine worker"-routine, that contains multiple loops of vector (and worker) level parallelism, how do vector_length and num_workers work?

I played around with some code (see below) and stumbled upon a few things:

Setting the vector length of these loops is seriously confusing me. Using the vector_length(#) clause on the outer parallel region seems to work weirdly, when comparing run times. When I increase the vector length to huge numbers, say e.g. 4096, the run time actually gets smaller. In my understanding, a huge amount of threads should lie dormant when there are only as many as 10 iterations in the vector loop. Am I doing something wrong here?
I noticed that the output weirdly depends on the number of workers in foo(). If it is 16 or smaller, the output is "correct". If it is 32 and even much larger, the loops inside the worker routine somehow get executed twice. What am I missing here?

Can someone give me a hand with the OpenACC routine clause? Many thanks in advance.

Here is the example code:

#include 
#include 

class A{
public:
    int out;
    int* some_array;
    A(){
        some_array = new int[1000*100*10];
        for(int i = 0; i < 1000*100*10; ++i){
            some_array[i] = 1;
        }
        #pragma acc enter data copyin(this, some_array[0:1000*100*10])
    };
    
    ~A(){ 
        #pragma acc exit data delete(some_array, this)
        delete [] some_array;
    }
    
    #pragma acc routine worker
    void some_worker(int i){
        int private_out = 10;
        #pragma acc loop vector reduction(+: private_out)
        for(int j=0; j < 10; ++j){
            //do some stuff
            private_out -= some_array[j];
        }
        #pragma acc loop reduction(+: private_out) worker
        for(int j=0; j < 100; ++j){
            #pragma acc loop reduction(+: private_out) vector
            for(int k=0; k < 10; ++k){
                //do some other stuff
                private_out += some_array[k+j*10+i*10*100];
            }
        }
        #pragma acc atomic update
        out += private_out;
    }
    
    void foo(){
        #pragma acc data present(this, some_array[0:1000*100*10]) pcreate(out)
        {
            #pragma acc serial
            out=0;
            //#######################################################
            //# setting num_workers and vector_length produce weird #
            //# results and runtimes                                #
            //#######################################################
            #pragma acc parallel loop gang num_workers(64) vector_length(4096)
            for(int i=0; i < 1000; ++i){
                some_worker(i);
            }
            #pragma acc update host(out)
        }
    }
};

int main() {
    using namespace std::chrono;
    A a;
    auto start = high_resolution_clock::now();
    a.foo();
    auto stop = high_resolution_clock::now();
    std::cout << a.out << std::endl
              << "took " << duration_cast(stop - start).count() << "ms" << std::endl;
    //output for num_workers(16) vector_length(4096)
    //1000000
    //took 844ms
    //
    //output for num_workers(16) vector_length(2)
    //1000000
    //took 1145ms
    //
    //output for num_workers(32) vector_length(2)
    //1990000
    //took 1480ms
    //
    //output for num_workers(64) vector_length(1)
    //1990000
    //took 502ms
    //
    //output for num_workers(64) vector_length(4096)
    //1000000
    //took 853ms
    return 0;
}

Machine specs: nvc++ 21.3-0 with OpenACC 2.7, Tesla K20c with cc35, NVIDIA-driver 470.103.01 with CUDA 11.4

Edit:

Additional information for 2.:

I simply used some printfs in the worker to look into the intermediate results. I placed them during the implicit barriers between the loops. I could see that the value of private_out went from initially 10

to -10 instead of 0 between the loops and
to 1990instead of 1000.

This just looks to me like both loops are being executed twice.

More results for convenience

To add some strangeness of this example: The code does not compile for some combinations of num_workers/vector_length. For e.g leaving num_workers just at 64 and setting the vector_length to 2,4,8,16 and even to 32 (which increases the threads over the limit of 1024). It gives the error message

ptxas error   : Entry function '_ZN1A14foo_298_gpu__1Ev' with max regcount of 32 calls function '_ZN1A11some_workerEi' with regcount of 41

However, simply inserting the printfs as described above, it suddenly compiles fine but runs into a runtime error: "call to cuLaunchKernel returned error 1: Invalid value".

But the most strange is, that it compiles and runs fine for 64/64 but returns incorrect results. Below is the output of this setting with NV_ACC_TIME=1, but note that the output is almost exactly the same for all compiling and running configurations, except for the block: [1x#-######]-part.

Accelerator Kernel Timing data
/path/to/src/main.cpp
  _ZN1AC1Ev  NVIDIA  devicenum=0
    time(us): 665
    265: data region reached 1 time
        265: data copyin transfers: 3
             device time(us): total=665 max=650 min=4 avg=221
/path/to/src/main.cpp
  _ZN1AD1Ev  NVIDIA  devicenum=0
    time(us): 8
    269: data region reached 1 time
        269: data copyin transfers: 1
             device time(us): total=8 max=8 min=8 avg=8
/path/to/src/main.cpp
  _ZN1A3fooEv  NVIDIA  devicenum=0
    time(us): 1,243
    296: data region reached 2 times
    298: compute region reached 2 times
        298: kernel launched 2 times
            grid: [1-1000]  block: [1-32x1-24]
             device time(us): total=1,230 max=1,225 min=5 avg=615
            elapsed time(us): total=1,556 max=1,242 min=314 avg=778
    304: update directive reached 1 time
        304: data copyout transfers: 1
             device time(us): total=13 max=13 min=13 avg=13

Mat Colgrove · Accepted Answer

The exact mapping of workers and vectors will depend on the target device and implementation. Specifically when using NVHPC targeting NVIDIA GPUs, a "gang" maps to a CUDA Block, "worker" maps the the y dimension of a thread block, and "vector" to the x-dimension. The value used in "num_workers" or "vector_length" may be reduced given the constrains of the target. CUDA Blocks can contain up to a maximum 1024 threads so the "4096" value will be reduced to what is allowed by the hardware. Secondly, in order to support vector reductions in device routines, a maximum vector_length can be 32. In other words, you're "4096" value is actually "32" due to these constraints.

Note to see the max thread block size on your device, run the "nvaccelinfo" utility and look for the "Maximum Threads per Block" and "Maximum Block Dimensions" fields. Also, setting the environment variable "NV_ACC_TIME=1" will have the runtime produce some basic profiling information, including the actual number of blocks and thread block size used during the run.

In my understanding, a huge amount of threads should lie dormant when there are only as many as 10 iterations in the vector loop.

CUDA threads are grouped into a "warp" of 32 threads where all threads of a warp execute the same instructions concurrently (aka SIMT or single instruction multiple threads). Hence even though only 10 threads are doing useful work, the remaining 12 are not dormant. Plus they still take resources such as registers so adding too many threads for loops with lower trip counts, may actually hurt performance.

In this case setting the vector length to 1 is most likey the best case since the warp can now be comprised of the y-dimension threads. Setting it to 2, will cause a full 32 thread warp in the x-dimension, but only 2 doing useful work.

As to why some combinations give incorrect results, I didn't investigate. Routine worker, especially with reductions, is rarely used so it's possible we have some type of code gen issue, like an off-by one error in the reduction, at these irregular schedule sizes. I'll look into this later and determine if I need to file an issue report.

For #2, How you're determining it's getting run twice? Is just this based on the runtime?

How does vector_length and num_workers work in an OpenACC routine?

Edit:

Additional information for 2.:

More results for convenience

Answers (1)

Related Questions