Intel compiler (ICC) unable to auto vectorize inner loop (matrix multiplication)

Question

EDIT:

ICC (after adding -qopt-report=5 -qopt-report-phase:vec):

LOOP BEGIN at 4.c(107,2)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization
   remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
   remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)

   LOOP BEGIN at 4.c(108,3)
      remark #15344: loop was not vectorized: vector dependence prevents vectorization
      remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
      remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)

      LOOP BEGIN at 4.c(109,4)
         remark #15344: loop was not vectorized: vector dependence prevents vectorization
         remark #15346: vector dependence: assumed FLOW dependence between c[i][j] (110:5) and c[i][j] (110:5)
         remark #15346: vector dependence: assumed ANTI dependence between c[i][j] (110:5) and c[i][j] (110:5)
      LOOP END

      LOOP BEGIN at 4.c(109,4)
      
      LOOP END
   LOOP END
LOOP END

It seems that the C[i][j] is read before it is written if vectorized (as I am doing reduction). The question is why the reduction is allowed is a local variable is introduced (temp)?

Original issue:

I have a C snippet below which does matrix multiplication. a, b - operands, c - a*b result. n - row&column length.

double ** c = create_matrix(...) // initialize n*n matrix with zeroes
double ** a = fill_matrix(...) // fills n*n matrix with random doubles
double ** b = fill_matrix(...) // fills n*n matrix with random doubles

for (i = 0; i < n; i++) {
    for (j = 0; j < n; j++) {
        for (k = 0; k < n; k++) {
            c[i][j] += a[i][k] * b[k][j];
        }
    }
}

The ICC (version 18.0.0.1) is not able to vectorize (provided -O3 flag) the inner loop.

ICC output:

LOOP BEGIN at 4.c(107,2)
   remark #25460: No loop optimizations reported

   LOOP BEGIN at 4.c(108,3)
      remark #25460: No loop optimizations reported

      LOOP BEGIN at 4.c(109,4)
         remark #25460: No loop optimizations reported
      LOOP END

      LOOP BEGIN at 4.c(109,4)
      
      LOOP END
   LOOP END
LOOP END

Though, with changes below, the compiler vectorizes the inner loop.

// OLD
for (k = 0; k < n; k++) {
  c[i][j] += a[i][k] * b[k][j];
}

// TO (NEW)
double tmp = 0;

for (k = 0; k < n; k++) {
    tmp += a[i][k] * b[k][j];
}

c[i][j] = tmp;

ICC vectorized output:

LOOP BEGIN at 4.c(119,2)
   remark #25460: No loop optimizations reported

   LOOP BEGIN at 4.c(120,3)
      remark #25460: No loop optimizations reported

      LOOP BEGIN at 4.c(134,4)
      
      LOOP END

      LOOP BEGIN at 4.c(134,4)
         remark #15300: LOOP WAS VECTORIZED
      LOOP END

      LOOP BEGIN at 4.c(134,4)
      
      LOOP END

      LOOP BEGIN at 4.c(134,4)
      
      LOOP END
   LOOP END
LOOP END

Instead of accumulating vector multiplication result in matrix C cell, the result is accumulated in a separate variable and assigned later.

Why does the compiler not optimize the first version? Could it be due to potential aliasing of a or / and b to c elements (Read after write problem)?

Intel compiler (ICC) unable to auto vectorize inner loop (matrix multiplication)

Answers (1)

Related Questions