Is there a way to customize the default parallelization behavior of whole-array statements in Chapel?

Question

According to the available documentation for Chapel, (whole-)array statements like

A = B + alpha * C;   // with A, B, and C being arrays, and alpha some scalar

are implemented in the language as the following forall iteration:

forall (a,b,c) in zip(A,B,C) do
   a = b + alpha * c;

Thus, array statements are by default executed by a team of parallel threads. Unfortunately, this also seems to completely preclude the (either partial or complete) vectorization of such statements. This can lead to performance surprises for programmers who are used to languages like Fortran or Python/Numpy (where the default behavior typically is to have array statements be only vectorized).

For codes that use (whole-)array statements with arrays of small to moderate size, the loss of vectorization (confirmed by Linux hardware performance counters) and the significant overhead inherent to parallel threads (which are unsuited to effectively exploit the fine-grained data-parallelism available in such problems) can result in significant loss of performance. As an example consider the following versions of Jacobi iteration that all solve the same problem on a domain of 300 x 300 zones:

Jacobi_1 employs array-statements, as follows:

/*
 *  Jacobi_1
 *
 *  This program (adapted from the Chapel distribution) performs
 *  niter iterations of the Jacobi method for the Laplace equation
 *  using (whole-)array statements.
 *
 */

config var n = 300;                  // size of n x n grid
config var niter = 10000;            // number of iterations to perform

proc main() {

  const Domain = {0..n+1,0..n+1};    // domain including boundary points

  var iteration = 0;                 // iteration counter
  var X, XNew: [Domain] real = 0.0;  // declare arrays: 
                                     //   X stores approximate solution
                                     //   XNew stores the next solution  
  X[n+1,1..n] = 1.0;                 // Set south boundary values to 1.0

  do {

    // compute next approximation
    XNew[1..n,1..n] =
      ( X[0..n-1,1..n] + X[2..n+1,1..n] +
        X[1..n,2..n+1] + X[1..n,0..n-1] ) / 4.0;

    // update X with next approximation
    X[1..n,1..n] = XNew[1..n,1..n];

    // advance iteration counter
    iteration += 1;

  } while (iteration < niter);

  writeln("Jacobi computation complete.");
  writeln("# of iterations: ", iteration);

} // main

Jacobi_2 employs serial for-loops throughout (i.e. only (auto-)vectorization by the back-end C-compiler is allowed):

/*
 *  Jacobi_2
 *
 *  This program (adapted from the Chapel distribution) performs
 *  niter iterations of the Jacobi method for the Laplace equation
 *  using (serial) for-loops.
 *
 */

config var n = 300;                  // size of n x n grid
config var niter = 10000;            // number of iterations to perform

proc main() {

  const Domain = {0..n+1,0..n+1};    // domain including boundary points

  var iteration = 0;                 // iteration counter
  var X, XNew: [Domain] real = 0.0;  // declare arrays: 
                                     //   X stores approximate solution
                                     //   XNew stores the next solution  
  for j in 1..n do
    X[n+1,j] = 1.0;                  // Set south boundary values to 1.0

  do {

    // compute next approximation
    for i in 1..n do
      for j in 1..n do  
        XNew[i,j] = ( X[i-1,j] + X[i+1,j] +
                      X[i,j+1] + X[i,j-1] ) / 4.0;

    // update X with next approximation
    for i in 1..n do
      for j in 1..n do
        X[i,j] = XNew[i,j];

    // advance iteration counter
    iteration += 1;

  } while (iteration < niter);

  writeln("Jacobi computation complete.");
  writeln("# of iterations: ", iteration);

} // main

Jacobi_3, finally, has the innermost loops vectorized and only the outermost loops threaded:

/*
 *  Jacobi_3
 *
 *  This program (adapted from the Chapel distribution) performs
 *  niter iterations of the Jacobi method for the Laplace equation
 *  using both parallel and serial (vectorized) loops.
 *
 */

config var n = 300;                  // size of n x n grid
config var niter = 10000;            // number of iterations to perform

proc main() {

  const Domain = {0..n+1,0..n+1};    // domain including boundary points

  var iteration = 0;                 // iteration counter
  var X, XNew: [Domain] real = 0.0;  // declare arrays: 
                                     //   X stores approximate solution
                                     //   XNew stores the next solution  
  for j in vectorizeOnly(1..n) do
    X[n+1,j] = 1.0;                  // Set south boundary values to 1.0

  do {

    // compute next approximation
    forall i in 1..n do
      for j in vectorizeOnly(1..n) do
        XNew[i,j] = ( X[i-1,j] + X[i+1,j] +
                      X[i,j+1] + X[i,j-1] ) / 4.0;

    // update X with next approximation
    forall i in 1..n do
      for j in vectorizeOnly(1..n) do
        X[i,j] = XNew[i,j];

    // advance iteration counter
    iteration += 1;

  } while (iteration < niter);

  writeln("Jacobi computation complete.");
  writeln("# of iterations: ", iteration);

} // main

Running these codes on a laptop with 2 processor-cores and using two parallel threads, one finds that Jacobi_1 is (surprisingly) more than ten times slower than Jacobi_2, which itself is (expectedly) ~1.6 times slower than Jacobi_3.

Unfortunately, this default behavior makes array statements completely unattractive for my use cases, even for algorithms which would benefit enormously from the more concise notation, and readability that (whole-)array statements can provide.

Are there ways for the user in Chapel to change this default behavior? That is, can a user customize the default parallelization of whole-array statements in a way that such array statements, as used in Jacobi_1, will behave either like the code in Jacobi_2 (which would be useful for code development and debugging purposes), or the code in Jacobi_3 (which, among those three, would be the method of choice for production calculations)?

I have tried to achieve this by plugging calls to "vectorizeOnly()" into the definition of "Domain" above, but to no avail.

Brad · Accepted Answer

Chapel's intent is to support vectorization automatically within the per-task serial loops that are used to implement forall loops (for cases that are legally vectorizable). Yet that capability is not well-supported today, as you note (even the vectorizeOnly() iterator that you're using is only considered prototypical).

I'll mention that we tend to see better vectorization results when using Chapel's LLVM back-end than we do with the (default) C back-end, and that we've seen even better results when utilizing Simon Moll's LLVM-based Region Vectorizer (Saarland University). But we've also seen cases where the LLVM back-end underperforms the C back-end, so your mileage may vary. But if you care about vectorization, it's worth a try.

To your specific question:

Are there ways for the user in Chapel to change this default behavior?

There are. For explicit forall loops, you can write your own parallel iterator which can be used to specify a different implementation strategy for a forall loop than our default iterators use. If you implement one that you like, you can then write (or clone and modify) a domain map (background here) to govern how loops over a given array are implemented by default (i.e., if no iterator is explicitly invoked). This permits end-users to specify different implementation policies for a Chapel array than the ones we support by default.

With respect to your three code variants, I'm noting that the first uses multidimensional zippering which is known to have significant performance problems today. This is the likely main cause of performance differences between it and the others. For example, I suspect that if you rewrote it using the form forall (i,j) in Domain ... and then used +/-1 indexing per-dimension, you'd see a significant improvement (and, I'd guess, performance that's much more comparable to the third case).

For the third, I'd be curious whether the benefits you're seeing are due to vectorization or simply due to multitasking since you've avoided the performance problem of the first and the serial implementation of the second. E.g., have you checked to see whether using the vectorizeOnly() iterator added any performance improvement over the same code without that iterator (or used tools on the binary files to inspect whether vectorization is occurring?)

In any Chapel performance study, make sure to throw the --fast compiler flag. And again, for best vectorization results, you might try the LLVM back-end.

Is there a way to customize the default parallelization behavior of whole-array statements in Chapel?

Answers (1)

Related Questions