Lukas
Lukas

Reputation: 13

Intel compiler (ICC) unable to auto vectorize inner loop (matrix multiplication)

EDIT:

ICC (after adding -qopt-report=5 -qopt-report-phase:vec):

LOOP BEGIN at 4.c(107,2)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization
   remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
   remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)

   LOOP BEGIN at 4.c(108,3)
      remark #15344: loop was not vectorized: vector dependence prevents vectorization
      remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
      remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)

      LOOP BEGIN at 4.c(109,4)
         remark #15344: loop was not vectorized: vector dependence prevents vectorization
         remark #15346: vector dependence: assumed FLOW dependence between c[i][j] (110:5) and c[i][j] (110:5)
         remark #15346: vector dependence: assumed ANTI dependence between c[i][j] (110:5) and c[i][j] (110:5)
      LOOP END

      LOOP BEGIN at 4.c(109,4)
      <Remainder>
      LOOP END
   LOOP END
LOOP END

It seems that the C[i][j] is read before it is written if vectorized (as I am doing reduction). The question is why the reduction is allowed is a local variable is introduced (temp)?

Original issue:

I have a C snippet below which does matrix multiplication. a, b - operands, c - a*b result. n - row&column length.

double ** c = create_matrix(...) // initialize n*n matrix with zeroes
double ** a = fill_matrix(...) // fills n*n matrix with random doubles
double ** b = fill_matrix(...) // fills n*n matrix with random doubles

for (i = 0; i < n; i++) {
    for (j = 0; j < n; j++) {
        for (k = 0; k < n; k++) {
            c[i][j] += a[i][k] * b[k][j];
        }
    }
}

The ICC (version 18.0.0.1) is not able to vectorize (provided -O3 flag) the inner loop.

ICC output:

LOOP BEGIN at 4.c(107,2)
   remark #25460: No loop optimizations reported

   LOOP BEGIN at 4.c(108,3)
      remark #25460: No loop optimizations reported

      LOOP BEGIN at 4.c(109,4)
         remark #25460: No loop optimizations reported
      LOOP END

      LOOP BEGIN at 4.c(109,4)
      <Remainder>
      LOOP END
   LOOP END
LOOP END

Though, with changes below, the compiler vectorizes the inner loop.

// OLD
for (k = 0; k < n; k++) {
  c[i][j] += a[i][k] * b[k][j];
}

// TO (NEW)
double tmp = 0;

for (k = 0; k < n; k++) {
    tmp += a[i][k] * b[k][j];
}

c[i][j] = tmp;

ICC vectorized output:

LOOP BEGIN at 4.c(119,2)
   remark #25460: No loop optimizations reported

   LOOP BEGIN at 4.c(120,3)
      remark #25460: No loop optimizations reported

      LOOP BEGIN at 4.c(134,4)
      <Peeled loop for vectorization>
      LOOP END

      LOOP BEGIN at 4.c(134,4)
         remark #15300: LOOP WAS VECTORIZED
      LOOP END

      LOOP BEGIN at 4.c(134,4)
      <Alternate Alignment Vectorized Loop>
      LOOP END

      LOOP BEGIN at 4.c(134,4)
      <Remainder loop for vectorization>
      LOOP END
   LOOP END
LOOP END

Instead of accumulating vector multiplication result in matrix C cell, the result is accumulated in a separate variable and assigned later.

Why does the compiler not optimize the first version? Could it be due to potential aliasing of a or / and b to c elements (Read after write problem)?

Upvotes: 0

Views: 988

Answers (1)

Richard
Richard

Reputation: 61289

Leverage Your Compiler

You don't show the flags you're using to get your vectorization report. I recommend:

-qopt-report=5 -qopt-report-phase:vec

The documentation says:

For levels n=1 through n=5, each level includes all the information of the previous level, as well as potentially some additional information. Level 5 produces the greatest level of detail. If you do not specify n, the default is level 2, which produces a medium level of detail.

With the higher level of detail, the compiler will likely tell you (in mysterious terms) why it is not vectorizing.

My suspicion is that the compiler is worried about the memory being aliased. The solution you've found allows the compiler to prove that it is not, so it performs the vectorization.

A Portable Solution

If you're into OpenMP, you could use:

#pragma omp simd
for (k = 0; k < n; k++)
  c[i][j] += a[i][k] * b[k][j];

to accomplish the same thing. I think Intel also has a set of compiler-specific directives that will do this in a non-portable way.

A Miscellaneous Note

This line:

double ** c = create_matrix(...)

makes me nervous. It suggests that somewhere you have something like this:

for(int i=0;i<10;i++)
  c[i] = new double[20];

That is, you have an array of arrays.

The problem is, this gives no guarantee that your subarrays are contiguous in memory. The result is suboptimal cache utilization. You want a 1D array that is addressed like a 2D array. Making a 2D array class which does this or using functions/macros to access elements will allow you to preserve much the same syntax while benefiting from better cache performance.

You might also consider compiling with the -align flag and decorating your code appropriately. This will give better SIMD performance by allowing aligned accesses to memory.

Upvotes: 1

Related Questions