Array slicing performance in Chapel

Question

I have a piece of C code that looks like the following:

for(int i = 0; i < numRows; i++) {
   double *myRow = matrixPtr + (i * numCols);
   for (int j = 0; j < numCols; j++) {
      someOperation(myRow[j]);
   }
}

Where matrixPtr is a 2D matrix stored in row-major layout. Getting a reference/pointer to each row is done to make the code a bit more readable and to avoid the need to compute the row-offset for each access in the inner most loop (i.e. matrixPtr[(i*numCols)+j]).

In Chapel, if I were to translate the above code, trying to match it closely, I could have something like this:

for i in 0..numRows-1 {
   ref myRow = matrix[i,..];
   for j in 0..numCols-1 {
      someOperation(myRow[j]);
   }
}

Where matrix is a Chapel matrix and myRow is a reference to a row slice of the matrix.

I noticed that the performance of the above Chapel code is very slow when compared to omitting the array-slicing step to get a row reference and just directly accessing matrix by [i,j], which I assume would closely resemble the C code above after compiler optimizations (and in fact, its performance is virtually the same as the above C code):

for i in 0..numRows-1 {
   for j in 0..numCols-1 {
      someOperation(matrix[i,j]);
   }
}

I can understand why it would be slower, as you need to perform array slicing to get each row. But what I'd like to know is why array slicing in Chapel is so much slower? To be up front, my Chapel experience is pretty minimal, at best.

I attempted to look at the generated C code to better understand what is happening. I first thought there was a lot of overhead associated with building a bounded range each time the row reference is created. To test that, I created the range up front and used that (i.e. ref myRow = matrix[i,colRange]). However, that didn't seem to improve the runtime. The only other piece in the generated C code that I could isolate as a potential clue is a function that changes the rank of an array (or something along those lines).

For perspective, this type of matrix operation is performed many times in an application I have, where numRows is very large and numCols is very small by comparison, and the performance of the Chapel code when using array slicing is 30-40x slower than directly accessing the matrix with [i,j] (with the --fast flag used during compilation).

Thanks

UPDATE:

Here are two small programs that should produce the reported slow down (somewhere between 20 and 25x) and compiled with the --fast flag. The direct.chpl program produces identical execution time to a version that gets a cPtr to the matrix and then computes the row offset, as described in the post above.

slices.chpl

use Time;
var numRows = 100000;
var numCols = 35;
var D : domain(2) = {0..numRows-1, 0..numCols-1};
var mat : [D] real;
mat = 1.0;
var totalTimer : Timer;
var accum : [0..numCols-1] real;
accum = 0.0;

totalTimer.start();
for i in 0..numRows-1 {
    ref myRow = mat(i,..);
    for j in 0..numCols-1 {
        accum[j] += i * myRow[j];
    }
}

totalTimer.stop();
writeln("Accum:");
writeln(accum);
writeln("
Total Elapsed time: ", totalTimer.elapsed(), " seconds");

direct.chpl:

use Time;
var numRows = 100000;
var numCols = 35;
var D : domain(2) = {0..numRows-1, 0..numCols-1};
var mat : [D] real;
mat = 1.0;
var totalTimer : Timer;
var accum : [0..numCols-1] real;
accum = 0.0;

totalTimer.start();
for i in 0..numRows-1 {
    for j in 0..numCols-1 {
        accum[j] += i * mat[i,j];
    }
}

totalTimer.stop();
writeln("Accum:");
writeln(accum);
writeln("
Total Elapsed time: ", totalTimer.elapsed(), " seconds");

Output for slices.chpl:

Accum:
4.99995e+09 4.99995e+09 4.99995e+09 4.99995e+09...

Total Elapsed time: 0.124494 seconds

Output for direct.chpl:

Accum:
4.99995e+09 4.99995e+09 4.99995e+09 4.99995e+09...

Total Elapsed time: 0.005211 seconds

So that seems to be a 23x difference in runtime. That isn't exactly the 30-40x difference I see in my real application, but its certainly more than I would have expected.

Brad · Accepted Answer

what I'd like to know is why array slicing in Chapel is so much slower [than my C code]?

I think that the answer here is that Chapel's array slices are designed to support all of the operations that an array supports, including bounds checks and queries, iteration, being passed to routines taking array arguments, reindexing, subsequent slicing, etc. That support requires setting up metadata to describe both the array and its domain (index set)—in this case, computed by intersecting matrix's domain with {i, ..}. So by nature it's more work than the pointer math in your C code, as you correctly anticipate.

That said, your report of 30-40x slower was surprising to us. A colleague and I have tried to reproduce your experiment and saw overheads that were more in the 2x ballpark (my results were gathered on a Mac laptop). This led us to wonder what could be different in your case. My best guess is that something about your program is reaching across locale boundaries to remote memory, e.g. by slicing a remote or distributed array. In any case, if you are able to share your code that reproduces this issue, I'd suggest opening a GitHub issue against this question with instructions on how the data was gathered in order to support exploring it further.

Generally speaking, we tend not to encourage using array views (array slices and reindexing) in performance-critical code, and particularly not as a means of trying to reproduce scalar C-style optimizations. Our preference is to have users write the code in the cleaner / more direct style and rely on the Chapel compiler to optimize it (and/or let us know when you find cases where we're leaving performance on the floor). Chapel's array views are designed primarily as a productivity feature, for example to support the creation of a view of an array or subarray in order to meet the formal argument requirements of an existing routine.

Array slicing performance in Chapel

Answers (1)

Related Questions