user1235183
user1235183

Reputation: 3072

dot product using cblas is slow

I want to calculate the product A^T*A ( A is 2000x1000 Matrix). Also i only want to solve the upper triangular Matrix. In the inner loop i have to solve the dot product of two vectors.

Now, here is the problem. Using cblas ddot() is not faster than calculating the dot product with a loop. How is this possible? (using Intel Core (TM)i7 CPU M620 @2,67GHz, 1,92GB RAM)

Upvotes: 2

Views: 4460

Answers (3)

himeshi
himeshi

Reputation: 485

If you're not using SSE2 intrinsics or using a data type that may not boost performance with them, you can try to transpose the matrix for an easy improvement in performance for larger matrix multiplications with cblas_?dot. Performing the matrix multiplication in blocks also helps.

void matMulDotProduct(int n, float *A, float* B, int a_size, int b_size, int a_row, int a_col, int b_row, int b_col, float *C) {
    int i, j, k;
    MKL_INT incx, incy;

    incx = 1;
    incy = b_size;

    //copy out multiplying matrix from larger matrix
    float *temp = (float*) malloc(n * n * sizeof(float));
    for (i = 0; i < n; ++i) {
        cblas_scopy(n, &B[(b_row * b_size) + b_col + i], incy, &temp[i * n], 1);
    }

    //transpose
    mkl_simatcopy('R', 'T', n, n, 1.0, temp, 1, 1);

    for (i = 0; i < n; i+= BLOCK_SIZE) {
        for (j = 0; j < n; j++) {
            for (k = 0; k < BLOCK_SIZE; ++k) {
                C[((i + k) * n) + j] = cblas_sdot(n, &A[(a_row + i + k) * a_size + a_col], incx, &temp[n * j], 1);
            }
        }
    }
    free(temp);
}

On my machine, this code is about 1 order of magnitude faster than the the 3 loop code (but also 1 order of magnitude slower than cblas_?gemm call) for single precision floats and 2K by 2K matrices. (I'm using Intel MKL).

Upvotes: 0

Wolfgang Jansen
Wolfgang Jansen

Reputation: 101

The problem is caused essentially by matrix size, not by ddot. Your matrices are so large that they do not fit in the cache memory. The solution is to rearrange the three nested loops such that as much as possible can be done with a line in cache, so reducing cache refreshes. A model implementation follows for both the ddot and an daxpy approach. On my computer the time consumption was about 15:1. In other words: never, never, never program a matrix multiplication along the "row times column" scheme that we learned in school.

    /*
      Matrix product of A^T * A by two methods.
      1) "Row times column" as we learned in school.
      2) With rearranged loops such that need for cash refreshes is reduced
         (this can be improved even more).

      Compile:  gcc -o aT_a aT_a.c -lgslcblas -lblas -lm
    */

    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    #include <cblas.h>

    #define ROWS 2000
    #define COLS 1000

    static double a[ROWS][COLS];
    static double c[COLS][COLS];

    static void dot() {
      int i, j;
      double *ai, *bj;
      ai = a[0];
      for (i=0; i<COLS; i++) {
        bj = a[0];
        for (j=0; j<COLS; j++) {
          c[i][j] = cblas_ddot(ROWS,ai,COLS,bj,COLS);
          bj += 1;
        }
        ai += 1;
      }
    }

    static void axpy() {
      int i, j;
      double *ci, *bj, aij;
      for (i=0; i<COLS; i++) {
        ci = c[i];
        for (j=0; j<COLS; j++) ci[j] = 0.;
        for (j=0; j<ROWS; j++) {
          aij = a[j][i];
          bj = a[j];
          cblas_daxpy(COLS,aij,bj,1,ci,1);
        }
      }
    }

    int main(int argc, char** argv) {
      clock_t t0, t1;
      int i, j;

      for (i=0; i<ROWS; ++i) 
        for (j=0; j<COLS; ++j)
          a[i][j] = i+j;

      t0 = clock();
      dot();
      t0 = clock();
      printf("Time for DOT : %f sec.\n",(double)t0/CLOCKS_PER_SEC);
      axpy();
      t1 = clock();
      printf("Time for AXPY: %f sec.\n",(double)(t1-t0)/CLOCKS_PER_SEC);

      return 0;
    }

Upvotes: 4

talonmies
talonmies

Reputation: 72343

The CBLAS dot product is effectively just a computation in slightly unrolled loop. The netlib Fortran is just this:

     DO I = MP1,N,5
      DTEMP = DTEMP + DX(I)*DY(I) + DX(I+1)*DY(I+1) +
 $            DX(I+2)*DY(I+2) + DX(I+3)*DY(I+3) + DX(I+4)*DY(I+4)
     END DO

ie. just a loop unrolled to a stride of 5.

If you must use a ddot style dot product for your operation, you might get a performance boost by re-writing your loop to use SSE2 intrinsics:

#include <emmintrin.h>

double ddotsse2(const double *x, const double *y, const int n)
{
    double result[2];
    int n2 = 2 * (n/2);
    __m128d dtemp;

    if ( (n % 2) == 0) {
        dtemp = _mm_setzero_pd(); 
    }  else {
        dtemp = _mm_set_sd(x[n] * y[n]);
    }

    for(int i=0; i<n2; i+=2) {
        __m128d x1 = _mm_loadr_pd(x+i);
        __m128d y1 = _mm_loadr_pd(y+i);
        __m128d xy = _mm_mul_pd(x1, y1);
        dtemp = _mm_add_pd(dtemp, xy);
    }

    _mm_store_pd(&result[0],dtemp);

    return result[0] + result[1];
}

(not tested, never been compiled, buyer beware).

This may or may be faster than the standard BLAS implementation. You may also want to investigate whether further loop unrolling could improve performance.

Upvotes: 1

Related Questions