user3667089
user3667089

Reputation: 3298

Matrix multiplication in cuSparse (cusparseDcsrgemm) outputs wrong results

I am trying to compute A^TA using cuSparse. A is a large but sparse matrix. The problem is when I use the function cusparseDcsrgemm, the computed output is wrong. Please see the below minimal example to reproduce the problem.

CMakeLists.txt

cmake_minimum_required(VERSION 3.11)

project(sample)

find_package(CUDA REQUIRED)

add_executable(${PROJECT_NAME} main.cpp)

target_compile_features(${PROJECT_NAME} PUBLIC cxx_std_14)

target_include_directories(${PROJECT_NAME} SYSTEM PUBLIC ${CUDA_INCLUDE_DIRS})

target_link_libraries(${PROJECT_NAME} ${CUDA_LIBRARIES} ${CUDA_cusparse_LIBRARY})

main.cpp

#include <iostream>
#include <vector>

#include <cuda_runtime_api.h>
#include <cusparse_v2.h>

int main(){
  // 3x3 identity matrix in CSR format
  std::vector<int> row;
  std::vector<int> col;
  std::vector<double> val;

  row.emplace_back(0);
  row.emplace_back(1);
  row.emplace_back(2);
  row.emplace_back(3);

  col.emplace_back(0);
  col.emplace_back(1);
  col.emplace_back(2);

  val.emplace_back(1);
  val.emplace_back(1);
  val.emplace_back(1);

  int *d_row;
  int *d_col;
  double *d_val;

  int *d_out_row;
  int *d_out_col;
  double *d_out_val;

  cudaMalloc(reinterpret_cast<void **>(&d_row), row.size() * sizeof(int));
  cudaMalloc(reinterpret_cast<void **>(&d_col), col.size() * sizeof(int));
  cudaMalloc(reinterpret_cast<void **>(&d_val), val.size() * sizeof(double));

  // we know identity transpose times identity is still identity 
  cudaMalloc(reinterpret_cast<void **>(&d_out_row), row.size() * sizeof(int));
  cudaMalloc(reinterpret_cast<void **>(&d_out_col), col.size() * sizeof(int));
  cudaMalloc(reinterpret_cast<void **>(&d_out_val), val.size() * sizeof(double));

  cudaMemcpy(
      d_row, row.data(), sizeof(int) * row.size(), cudaMemcpyHostToDevice);
  cudaMemcpy(
      d_col, col.data(), sizeof(int) * col.size(), cudaMemcpyHostToDevice);
  cudaMemcpy(
      d_val, val.data(), sizeof(double) * val.size(), cudaMemcpyHostToDevice);

  cusparseHandle_t handle;
  cusparseCreate(&handle);

  cusparseMatDescr_t descr;
  cusparseCreateMatDescr(&descr);
  cusparseSetMatType(descr, CUSPARSE_MATRIX_TYPE_GENERAL);
  cusparseSetMatIndexBase(descr, CUSPARSE_INDEX_BASE_ZERO);

  cusparseMatDescr_t descr_out;
  cusparseCreateMatDescr(&descr_out);
  cusparseSetMatType(descr_out, CUSPARSE_MATRIX_TYPE_GENERAL);
  cusparseSetMatIndexBase(descr_out, CUSPARSE_INDEX_BASE_ZERO);

  cusparseDcsrgemm(handle,
                   CUSPARSE_OPERATION_TRANSPOSE,
                   CUSPARSE_OPERATION_NON_TRANSPOSE,
                   3,
                   3,
                   3,
                   descr,
                   3,
                   d_val,
                   d_row,
                   d_col,
                   descr,
                   3,
                   d_val,
                   d_row,
                   d_col,
                   descr_out,
                   d_out_val,
                   d_out_row,
                   d_out_col);

  cudaMemcpy(
      row.data(), d_out_row, sizeof(int) * row.size(), cudaMemcpyDeviceToHost);
  cudaMemcpy(
      col.data(), d_out_col, sizeof(int) * col.size(), cudaMemcpyDeviceToHost);
  cudaMemcpy(
      val.data(), d_out_val, sizeof(double) * val.size(), cudaMemcpyDeviceToHost);

  std::cout << "row" << std::endl;
  for (int i : row)
  {
    std::cout << i << std::endl; //show 0 0 0 0, but it should be 0 1 2 3
  }

  std::cout << "col" << std::endl;
  for (int i : col)
  {
    std::cout << i << std::endl; //show 1 0 0, but it should be 0 1 2
  }

  std::cout << "val" << std::endl;
  for (int i : val)
  {
    std::cout << i << std::endl; //show 1 0 0, but it should be 1 1 1
  }

  return 0;
}

What am I doing wrong?

Upvotes: 1

Views: 568

Answers (1)

BlameTheBits
BlameTheBits

Reputation: 861

You simply forgot one step because you tried to make an easy example. In the documentation it is stated:

The cuSPARSE library adopts a two-step approach to complete sparse matrix. In the first step, the user allocates csrRowPtrC of m+1 elements and uses the function cusparseXcsrgemmNnz() to determine csrRowPtrC and the total number of nonzero elements.

What you did is to allocate m+1(m=3 in your example) elements for d_row_out and you determined the total number of nonzero elements which is 3 in your example. But you missed do "determine d_row_out" which means to fill the vector with the right values. In your simple example you could just add the line

cudaMemcpy(d_out_row, row.data(), sizeof(int) * row.size(), cudaMemcpyHostToDevice);

somewhere before your gemm call.

The more general approach of course would be to use the suggested function cusparseXcsrgemmNnz(). You could add the following lines somewhere before your gemm call (many values are still hardcoded as in your example, so it's not really general):

int nnz_check[1];
cusparseXcsrgemmNnz(handle,
                    CUSPARSE_OPERATION_TRANSPOSE,
                    CUSPARSE_OPERATION_NON_TRANSPOSE,
                    3,
                    3,
                    3,
                    descr,
                    3,
                    d_row,
                    d_col,
                    descr,
                    3,
                    d_row,
                    d_col,
                    descr_out,
                    d_out_row,  // the values this pointer points to will be set
                    nnz_check); // the number of nonzeros will also be calculated
assert(nnz_check[0] == 3);

Side note: The documentation says "[[DEPRECATED]] use cusparse<t>csrgemm2() instead. The routine will be removed in the next major release", that is version 11. The problem still remains for the second gemm version though as the same two-step approach is used.

Upvotes: 1

Related Questions