Reputation:
According to the available documentation for Chapel, (whole-)array statements like
A = B + alpha * C; // with A, B, and C being arrays, and alpha some scalar
are implemented in the language as the following forall iteration:
forall (a,b,c) in zip(A,B,C) do
a = b + alpha * c;
Thus, array statements are by default executed by a team of parallel threads. Unfortunately, this also seems to completely preclude the (either partial or complete) vectorization of such statements. This can lead to performance surprises for programmers who are used to languages like Fortran or Python/Numpy (where the default behavior typically is to have array statements be only vectorized).
For codes that use (whole-)array statements with arrays of small to moderate size, the loss of vectorization (confirmed by Linux hardware performance counters) and the significant overhead inherent to parallel threads (which are unsuited to effectively exploit the fine-grained data-parallelism available in such problems) can result in significant loss of performance. As an example consider the following versions of Jacobi iteration that all solve the same problem on a domain of 300 x 300 zones:
Jacobi_1
employs array-statements, as follows:
/*
* Jacobi_1
*
* This program (adapted from the Chapel distribution) performs
* niter iterations of the Jacobi method for the Laplace equation
* using (whole-)array statements.
*
*/
config var n = 300; // size of n x n grid
config var niter = 10000; // number of iterations to perform
proc main() {
const Domain = {0..n+1,0..n+1}; // domain including boundary points
var iteration = 0; // iteration counter
var X, XNew: [Domain] real = 0.0; // declare arrays:
// X stores approximate solution
// XNew stores the next solution
X[n+1,1..n] = 1.0; // Set south boundary values to 1.0
do {
// compute next approximation
XNew[1..n,1..n] =
( X[0..n-1,1..n] + X[2..n+1,1..n] +
X[1..n,2..n+1] + X[1..n,0..n-1] ) / 4.0;
// update X with next approximation
X[1..n,1..n] = XNew[1..n,1..n];
// advance iteration counter
iteration += 1;
} while (iteration < niter);
writeln("Jacobi computation complete.");
writeln("# of iterations: ", iteration);
} // main
Jacobi_2
employs serial for-loops throughout (i.e. only (auto-)vectorization
by the back-end C-compiler is allowed):
/*
* Jacobi_2
*
* This program (adapted from the Chapel distribution) performs
* niter iterations of the Jacobi method for the Laplace equation
* using (serial) for-loops.
*
*/
config var n = 300; // size of n x n grid
config var niter = 10000; // number of iterations to perform
proc main() {
const Domain = {0..n+1,0..n+1}; // domain including boundary points
var iteration = 0; // iteration counter
var X, XNew: [Domain] real = 0.0; // declare arrays:
// X stores approximate solution
// XNew stores the next solution
for j in 1..n do
X[n+1,j] = 1.0; // Set south boundary values to 1.0
do {
// compute next approximation
for i in 1..n do
for j in 1..n do
XNew[i,j] = ( X[i-1,j] + X[i+1,j] +
X[i,j+1] + X[i,j-1] ) / 4.0;
// update X with next approximation
for i in 1..n do
for j in 1..n do
X[i,j] = XNew[i,j];
// advance iteration counter
iteration += 1;
} while (iteration < niter);
writeln("Jacobi computation complete.");
writeln("# of iterations: ", iteration);
} // main
Jacobi_3
, finally, has the innermost loops vectorized and only the
outermost loops threaded:
/*
* Jacobi_3
*
* This program (adapted from the Chapel distribution) performs
* niter iterations of the Jacobi method for the Laplace equation
* using both parallel and serial (vectorized) loops.
*
*/
config var n = 300; // size of n x n grid
config var niter = 10000; // number of iterations to perform
proc main() {
const Domain = {0..n+1,0..n+1}; // domain including boundary points
var iteration = 0; // iteration counter
var X, XNew: [Domain] real = 0.0; // declare arrays:
// X stores approximate solution
// XNew stores the next solution
for j in vectorizeOnly(1..n) do
X[n+1,j] = 1.0; // Set south boundary values to 1.0
do {
// compute next approximation
forall i in 1..n do
for j in vectorizeOnly(1..n) do
XNew[i,j] = ( X[i-1,j] + X[i+1,j] +
X[i,j+1] + X[i,j-1] ) / 4.0;
// update X with next approximation
forall i in 1..n do
for j in vectorizeOnly(1..n) do
X[i,j] = XNew[i,j];
// advance iteration counter
iteration += 1;
} while (iteration < niter);
writeln("Jacobi computation complete.");
writeln("# of iterations: ", iteration);
} // main
Running these codes on a laptop with 2 processor-cores and using two
parallel threads, one finds that Jacobi_1
is (surprisingly)
more than ten times slower than Jacobi_2
, which itself is (expectedly)
~1.6 times slower than Jacobi_3
.
Unfortunately, this default behavior makes array statements completely unattractive for my use cases, even for algorithms which would benefit enormously from the more concise notation, and readability that (whole-)array statements can provide.
Are there ways for the user in Chapel to change this default behavior?
That is, can a user customize the default parallelization of whole-array
statements in a way that such array statements, as used in Jacobi_1
, will
behave either like the code in Jacobi_2
(which would be useful for code development and debugging purposes), or the code in Jacobi_3
(which, among those three, would be the method of choice for production calculations)?
I have tried to achieve this by plugging calls to "vectorizeOnly()
" into
the definition of "Domain" above, but to no avail.
Upvotes: 5
Views: 169
Reputation: 4008
Chapel's intent is to support vectorization automatically within the per-task serial loops that are used to implement forall
loops (for cases that are legally vectorizable). Yet that capability is not well-supported today, as you note (even the vectorizeOnly()
iterator that you're using is only considered prototypical).
I'll mention that we tend to see better vectorization results when using Chapel's LLVM back-end than we do with the (default) C back-end, and that we've seen even better results when utilizing Simon Moll's LLVM-based Region Vectorizer (Saarland University). But we've also seen cases where the LLVM back-end underperforms the C back-end, so your mileage may vary. But if you care about vectorization, it's worth a try.
To your specific question:
Are there ways for the user in Chapel to change this default behavior?
There are. For explicit forall
loops, you can write your own parallel iterator which can be used to specify a different implementation strategy for a forall
loop than our default iterators use. If you implement one that you like, you can then write (or clone and modify) a domain map (background here) to govern how loops over a given array are implemented by default (i.e., if no iterator is explicitly invoked). This permits end-users to specify different implementation policies for a Chapel array than the ones we support by default.
With respect to your three code variants, I'm noting that the first uses multidimensional zippering which is known to have significant performance problems today. This is the likely main cause of performance differences between it and the others. For example, I suspect that if you rewrote it using the form forall (i,j) in Domain ...
and then used +/-1 indexing per-dimension, you'd see a significant improvement (and, I'd guess, performance that's much more comparable to the third case).
For the third, I'd be curious whether the benefits you're seeing are due to vectorization or simply due to multitasking since you've avoided the performance problem of the first and the serial implementation of the second. E.g., have you checked to see whether using the vectorizeOnly() iterator added any performance improvement over the same code without that iterator (or used tools on the binary files to inspect whether vectorization is occurring?)
In any Chapel performance study, make sure to throw the --fast
compiler flag. And again, for best vectorization results, you might try the LLVM back-end.
Upvotes: 2