Jakub Pawłowski
Jakub Pawłowski

Reputation: 23

Using Intel oneAPI MKL to perform sparse matrix with dense vector multiplication

I am developing a program, making heavy use of Armadillo library. I have the 10.8.2 version, linked against Intel oneAPI MKL 2022.0.2. At some point, I need to perform many sparse matrix times dense vector multiplications, both of which are defined using Armadillo structures. I have found this point to be a probable bottleneck, and was being curious if replacing the Armadillo multiplication with "bare bones" sparse CBLAS routines from MKL (mkl_sparse_d_mv) would speed things up. But in order to do so, I need to convert from Armadillo's SpMat to something that MKL understands. As per Armadillo docs, sparse matrices are stored in CSC format, so I have tried mkl_sparse_d_create_csc. My attempt at this is below:

#include <iostream>
#include <armadillo>
#include "mkl.h"

int main()
{

    arma::umat locations = {{0, 0, 1, 3, 2},{0, 1, 0, 2, 3}};
    // arma::vec vals = {0.5, 2.5, 2.5, 4.5, 4.5};
    arma::vec vals = {0.5, 2.5, 3.5, 4.5, 5.5};
    arma::sp_mat X(locations, vals);
    std::cout << "X = \n" << arma::mat(X) << std::endl;
    arma::vec v = {1,1,1,1};
    arma::vec v2;
    v2.resize(4);

    std::cout << "v = \n" << v << std::endl;
    std::cout << "X * v = \n" << X * v << std::endl;

    MKL_INT *cols_beg = static_cast<MKL_INT *>(mkl_malloc(X.n_cols * sizeof(MKL_INT), 64));
    MKL_INT *cols_end = static_cast<MKL_INT *>(mkl_malloc(X.n_cols * sizeof(MKL_INT), 64));
    MKL_INT *row_idx = static_cast<MKL_INT *>(mkl_malloc(X.n_nonzero * sizeof(MKL_INT), 64));
    double *values = static_cast<double *>(mkl_malloc(X.n_nonzero * sizeof(double), 64));

    for (MKL_INT i = 0; i < X.n_cols; i++)
    {
        cols_beg[i] = static_cast<MKL_INT>(X.col_ptrs[i]);
        cols_end[i] = static_cast<MKL_INT>((--X.end_col(i)).pos());
        std::cout << cols_beg[i] << " --- " << cols_end[i] << std::endl;
    }
    std::cout << std::endl;
    for (MKL_INT i = 0; i < X.n_nonzero; i++)
    {
        row_idx[i] = static_cast<MKL_INT>(X.row_indices[i]);
        values[i] = X.values[i];
        std::cout << row_idx[i] << " --- " << values[i] << std::endl;
    }
    std::cout << std::endl;
    sparse_matrix_t X_mkl = NULL;
    sparse_status_t res = mkl_sparse_d_create_csc(&X_mkl, SPARSE_INDEX_BASE_ZERO,
                                                  X.n_rows, X.n_cols, cols_beg, cols_end, row_idx, values);

    if(res == SPARSE_STATUS_SUCCESS) std::cout << "Constructed mkl representation of X" << std::endl;
    matrix_descr dsc;
    dsc.type = SPARSE_MATRIX_TYPE_GENERAL;

    sparse_status_t stat =  mkl_sparse_d_mv(SPARSE_OPERATION_NON_TRANSPOSE, 1.0, X_mkl, dsc, v.memptr(), 0.0, v2.memptr());
    std::cout << "Multiplication status = " << stat << std::endl;
    if(stat == SPARSE_STATUS_SUCCESS) 
    {
        std::cout << "Calculated X*v via mkl" << std::endl;
        std::cout << v2;
    }

    mkl_free(cols_beg);
    mkl_free(cols_end);
    mkl_free(row_idx);
    mkl_free(values);
    mkl_sparse_destroy(X_mkl);

    return 0;
}

I am compiling this code with (with the help of Link Line Advisor) icpc -g testing.cpp -o intel_testing.out -DARMA_ALLOW_FAKE_GCC -O3 -xhost -Wall -Wextra -L${MKLROOT}/lib/intel64 -liomp5 -lpthread -lm -DMKL_ILP64 -qmkl=parallel -larmadillo on Pop!_OS 21.10. It compiles and runs without any problems. The output is as follows:

X = 
   0.5000   2.5000        0        0
   3.5000        0        0        0
        0        0        0   5.5000
        0        0   4.5000        0

v = 
   1.0000
   1.0000
   1.0000
   1.0000

X * v = 
   3.0000
   3.5000
   5.5000
   4.5000

0 --- 1
2 --- 2
3 --- 3
4 --- 4

0 --- 0.5
1 --- 3.5
0 --- 2.5
3 --- 4.5
2 --- 5.5

Constructed mkl representation of X
Multiplication status = 0
Calculated X*v via mkl
   0.5000
        0
        0
        0

As we can see, the result of Armadillo's multiplication is correct, whereas the one from MKL is wrong. My question is this: Am I making a mistake somewhere? Or is there something wrong with MKL?. I suspect the former of course, but after spending considerable amount of time, cannot find anything. Any help would be much appreciated!

EDIT

As CJR and Vidyalatha_Intel suggested, I have changed col_end to

 cols_end[i] = static_cast<MKL_INT>((X.end_col(i)).pos());

The result is now

X = 
   0.5000   2.5000        0        0
   3.5000        0        0        0
        0        0        0   5.5000
        0        0   4.5000        0

v = 
   1.0000
   1.0000
   1.0000
   1.0000

X * v = 
   3.0000
   3.5000
   5.5000
   4.5000

0 --- 2
2 --- 3
3 --- 4
4 --- 5

0 --- 0.5
1 --- 3.5
0 --- 2.5
3 --- 4.5
2 --- 5.5

Constructed mkl representation of X
Multiplication status = 0
Calculated X*v via mkl
   4.0000
   2.5000
        0
        0

col_end is indeed 2,3,4,5 as suggested, but the result is still wrong.

Upvotes: 2

Views: 1160

Answers (1)

Vidyalatha_Intel
Vidyalatha_Intel

Reputation: 462

Yes, the cols_end array is incorrect as pointed out by CJR. They should be indexed as 2,3,4,5. Please see the documentation regarding the parameter to the function mkl_sparse_d_create_csc

cols_end:

This array contains col indices, such that cols_end[i] - ind - 1 is the last index of col i in the arrays values and row_indx. ind takes 0 for zero-based indexing and 1 for one-based indexing.

https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/blas-and-sparse-blas-routines/inspector-executor-sparse-blas-routines/matrix-manipulation-routines/mkl-sparse-create-csc.html

Change this line

cols_end[i] = static_cast<MKL_INT>((--X.end_col(i)).pos());

to

cols_end[i] = static_cast<MKL_INT>((X.end_col(i)).pos());

Now recompile and run the code. I've tested it and it is showing the correct results. Image with results and compilation command enter image description here

Upvotes: 2

Related Questions