Reputation: 13
EDIT:
ICC (after adding -qopt-report=5 -qopt-report-phase:vec):
LOOP BEGIN at 4.c(107,2)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
LOOP BEGIN at 4.c(108,3)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
LOOP BEGIN at 4.c(109,4)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed FLOW dependence between c[i][j] (110:5) and c[i][j] (110:5)
remark #15346: vector dependence: assumed ANTI dependence between c[i][j] (110:5) and c[i][j] (110:5)
LOOP END
LOOP BEGIN at 4.c(109,4)
<Remainder>
LOOP END
LOOP END
LOOP END
It seems that the C[i][j] is read before it is written if vectorized (as I am doing reduction). The question is why the reduction is allowed is a local variable is introduced (temp)?
Original issue:
I have a C snippet below which does matrix multiplication. a, b - operands, c - a*b result. n - row&column length.
double ** c = create_matrix(...) // initialize n*n matrix with zeroes
double ** a = fill_matrix(...) // fills n*n matrix with random doubles
double ** b = fill_matrix(...) // fills n*n matrix with random doubles
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
The ICC (version 18.0.0.1) is not able to vectorize (provided -O3 flag) the inner loop.
ICC output:
LOOP BEGIN at 4.c(107,2)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(108,3)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(109,4)
remark #25460: No loop optimizations reported
LOOP END
LOOP BEGIN at 4.c(109,4)
<Remainder>
LOOP END
LOOP END
LOOP END
Though, with changes below, the compiler vectorizes the inner loop.
// OLD
for (k = 0; k < n; k++) {
c[i][j] += a[i][k] * b[k][j];
}
// TO (NEW)
double tmp = 0;
for (k = 0; k < n; k++) {
tmp += a[i][k] * b[k][j];
}
c[i][j] = tmp;
ICC vectorized output:
LOOP BEGIN at 4.c(119,2)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(120,3)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(134,4)
<Peeled loop for vectorization>
LOOP END
LOOP BEGIN at 4.c(134,4)
remark #15300: LOOP WAS VECTORIZED
LOOP END
LOOP BEGIN at 4.c(134,4)
<Alternate Alignment Vectorized Loop>
LOOP END
LOOP BEGIN at 4.c(134,4)
<Remainder loop for vectorization>
LOOP END
LOOP END
LOOP END
Instead of accumulating vector multiplication result in matrix C cell, the result is accumulated in a separate variable and assigned later.
Why does the compiler not optimize the first version? Could it be due to potential aliasing of a or / and b to c elements (Read after write problem)?
Upvotes: 0
Views: 988
Reputation: 61289
Leverage Your Compiler
You don't show the flags you're using to get your vectorization report. I recommend:
-qopt-report=5 -qopt-report-phase:vec
The documentation says:
For levels n=1 through n=5, each level includes all the information of the previous level, as well as potentially some additional information. Level 5 produces the greatest level of detail. If you do not specify n, the default is level 2, which produces a medium level of detail.
With the higher level of detail, the compiler will likely tell you (in mysterious terms) why it is not vectorizing.
My suspicion is that the compiler is worried about the memory being aliased. The solution you've found allows the compiler to prove that it is not, so it performs the vectorization.
A Portable Solution
If you're into OpenMP, you could use:
#pragma omp simd
for (k = 0; k < n; k++)
c[i][j] += a[i][k] * b[k][j];
to accomplish the same thing. I think Intel also has a set of compiler-specific directives that will do this in a non-portable way.
A Miscellaneous Note
This line:
double ** c = create_matrix(...)
makes me nervous. It suggests that somewhere you have something like this:
for(int i=0;i<10;i++)
c[i] = new double[20];
That is, you have an array of arrays.
The problem is, this gives no guarantee that your subarrays are contiguous in memory. The result is suboptimal cache utilization. You want a 1D array that is addressed like a 2D array. Making a 2D array class which does this or using functions/macros to access elements will allow you to preserve much the same syntax while benefiting from better cache performance.
You might also consider compiling with the -align
flag and decorating your code appropriately. This will give better SIMD performance by allowing aligned accesses to memory.
Upvotes: 1