Reputation: 3184
Consider the following functions which store values row-wise-ly and column-wise-ly.
#include <Rcpp.h>
using namespace Rcpp;
const int m = 10000;
const int n = 3;
// [[Rcpp::export]]
SEXP rowWise() {
SEXP A = Rf_allocMatrix(INTSXP, m, n);
int* p = INTEGER(A);
int i, j;
for (i = 0; i < m; i++){
for(j = 0; j < n; j++) {
p[m * j + i] = j;
}
}
return A;
}
// [[Rcpp::export]]
SEXP columnWise() {
SEXP A = Rf_allocMatrix(INTSXP, n, m);
int* p = INTEGER(A);
int i, j;
for(j = 0; j < m; j++) {
for (i = 0; i < n; i++){
p[n * j + i] = i;
}
}
return A;
}
/*** R
library(microbenchmark)
gc()
microbenchmark(
rowWise(),
columnWise(),
times = 1000
)
*/
The above code yields
Unit: microseconds
expr min lq mean median uq max neval
rowWise() 12.524 18.631 64.24991 20.4540 24.8385 10894.353 1000
columnWise() 11.803 19.434 40.08047 20.9005 24.1585 8590.663 1000
Accessing values row-wise-ly is faster (if not slower) than accessing them column-wise-ly, which is counter-intuitive to what I believe.
However, it does depend magically on the value of m
and n
. So I guess my question is: why columnWise
is not much faster than rowWise
?
Upvotes: 7
Views: 1336
Reputation: 73385
The dimension (shape) of the matrix has an impact.
When we do a row-wise scan of a 10000 x 3
integer matrix A
, we can still effectively do caching. For simplicity of illustration, I assume that each column of A
are aligned to a cache line.
--------------------------------------
A[1, 1] A[1, 2] A[1, 3] M M M
A[2, 1] A[2, 2] A[2, 3] H H H
. . . . . .
. . . . . .
A[16,1] A[16,2] A[16,3] H H H
--------------------------------------
A[17,1] A[17,2] A[17,3] M M M
A[18,1] A[18,2] A[18,3] H H H
. . . . . .
. . . . . .
A[32,1] A[32,2] A[32,3] H H H
--------------------------------------
A[33,1] A[33,2] A[33,3] M M M
A[34,1] A[34,2] A[34,3] H H H
. . . . . .
. . . . . .
A 64-bit cache line can hold 16 integers. When we access A[1, 1]
, a full cache line is filled, that is, A[1, 1]
to A[16, 1]
are all loaded into cache. When we scan a row A[1, 1], A[1, 2], A[1, 3]
, a 16 x 3
matrix is loaded into cache and it is much smaller than cache capacity (32 KB). While we have a cache miss (M) for each element in the 1st row, when we start to scan the 2nd row, we have a cache hit (H) for every element. So we have a periodic pattern as such:
[3 Misses] -> [45 Hits] -> [3 Misses] -> [45 Hits] -> ...
That is, we have on average a cache miss ratio of 3 / 48 = 1 / 16 = 6.25%
. In fact, this equals to the cache miss ratio if we scan A
column-wise, where we have the following periodic pattern:
[1 Miss] -> [15 Hits] -> [1 Miss] -> [15 Hits] -> ...
Try a 5000 x 5000
matrix. In that case, after reading the first row, 16 x 5000
elements are fetched into cache but that is much larger than cache capacity so cache eviction has happened to kick out the A[1, 1]
to A[16, 1]
(most cache applies "least recently unused" cache line replacement policy). When we come back to scan the 2nd row, we have to fetch A[2, 1]
from RAM again. So a row-wise scan gives a cache miss ratio of 100%
. In contrasts, a column-wise scan only has a cache miss ratio of 1 / 16 = 6.25%
. In this example, we will observe that column-wise scan is much faster.
In summary, with a 10000 x 3
matrix, we have the same cache performance whether we scan it by row or column. I don't see that rowWise
is faster than columnWise
from the median time reported by microbenchmark
. Their execution time may not be exactly equal, but the difference is too minor to cause our concern.
For a
5000 x 5000
matrix,rowWise
is much slower thancolumnWise
.
Thanks for verification.
The "golden rule" that we should ensure sequential memory access in the innermost loop is a general guideline for efficiency. But don't understand it in the narrow sense.
In fact, if you treat the three columns of A
as three vectors x
, y
, z
, and consider the element-wise addition (i.e., the row-wise sum of A
): z[i] = x[i] + y[i]
, are we not having a sequential access for all three vectors? Doesn't this fall into the "golden rule"? Scanning a 10000 x 3
matrix by row is no difference from alternately reading three vectors sequentially. And this is very efficient.
Upvotes: 12