Reputation: 371
I try to perform a QR factorization on GPU using the cusolver library from CUDA.
I reduced my problem to the example below.
Basically, the few steps are :
cusolverDnCreate
cusolverDnDgeqrf_bufferSize
cusolverDnDgeqrf
Unfortunately, the last command systematically fails by returning a CUSOLVER_STATUS_EXECUTION_FAILED
(int value = 6) and I can't figure out what went wrong!
Here is the faulty code:
#include <cusolverDn.h>
#include <cuda_runtime_api.h>
int main(void)
{
int N = 5, P = 3;
double *hostData;
cudaMallocHost((void **) &hostData, N * sizeof(double));
for (int i = 0; i < N * P; ++i)
hostData[i] = 1.;
double *devData;
cudaMalloc((void**)&devData, N * sizeof(double));
cudaMemcpy((void*)devData, (void*)hostData, N * sizeof(double), cudaMemcpyHostToDevice);
cusolverStatus_t retVal;
cusolverDnHandle_t solverHandle;
retVal = cusolverDnCreate(&solverHandle);
std::cout << "Handler creation : " << retVal << std::endl;
double *devTau, *work;
int szWork;
cudaMalloc((void**)&devTau, P * sizeof(double));
retVal = cusolverDnDgeqrf_bufferSize(solverHandle, N, P, devData, N, &szWork);
std::cout << "Work space sizing : " << retVal << std::endl;
cudaMalloc((void**)&work, szWork * sizeof(double));
int *devInfo;
cudaMalloc((void **)&devInfo, 1);
retVal = cusolverDnDgeqrf(solverHandle, N, P, devData, N, devTau, work, szWork, devInfo); //CUSOLVER_STATUS_EXECUTION_FAILED
std::cout << "QR factorization : " << retVal << std::endl;
int hDevInfo = 0;
cudaMemcpy((void*)devInfo, (void*)&hDevInfo, 1 * sizeof(int), cudaMemcpyDeviceToHost);
std::cout << "Info device : " << hDevInfo << std::endl;
cudaFree(devInfo);
cudaFree(work);
cudaFree(devTau);
cudaFree(devData);
cudaFreeHost(hostData);
cudaDeviceReset();
}
Would you see any obvious error in my code, please let me know! Many thanks.
Upvotes: 1
Views: 1189
Reputation: 152279
Any time you are having trouble with a cuda code, you should always use proper cuda error checking and run your code with cuda-memcheck
, before asking for help.
You may also want to be aware of the fact that a fully worked QR factorization example is given in the relevant CUDA/cusolver sample code and there is also sample code in the documentation.
With proper error checking, you may have discovered:
this is not correct:
cudaMalloc((void **)&devInfo, 1);
the second parameter is the size in bytes, so it should be sizeof(int)
, not 1. This error results in an error in a cudaMemcpyAsync
operation internal to the cusolverDnDgeqrf
call, which would show up in cuda-memcheck
output.
This is not correct:
cudaMemcpy((void*)devInfo, (void*)&hDevInfo, 1 * sizeof(int), cudaMemcpyDeviceToHost);
the order of the pointer parameters is destination first, followed by source. So you have those parameters reversed, and this call would throw a runtime API error that you could observe if you were doing proper error checking (or visible in cuda-memcheck
output).
Once you fix those errors, then the qrf call will actually return a zero status (no error). But we're not quite done yet (again, proper error checking would let us know we are not quite done yet.)
In addition to the above errors, you have made some additional sizing errors. Your matrix is of size N*P
, so it has N*P
elements, and you are initializing that many elements here:
for (int i = 0; i < N * P; ++i)
hostData[i] = 1.;
but you are not allocating for that many elements on the host here:
cudaMallocHost((void **) &hostData, N * sizeof(double));
or on the device here:
cudaMalloc((void**)&devData, N * sizeof(double));
and you are not transferring that many elements here:
cudaMemcpy((void*)devData, (void*)hostData, N * sizeof(double), cudaMemcpyHostToDevice);
So in the 3 cases above, if you change N*sizeof(double)
to N*P*sizeof(double)
you will be able to fix those errors, and the code then runs with no errors reported by cuda-memcheck
, and also no errors returned from any of the API calls.
Upvotes: 3