Reputation: 109
I have written a wrapper in C++11/CLI with Visual Studio to use CUDA's CuBLAS. I am using CUDA Toolkit 7.0.
Here is the source code of my wrapper:
#pragma once
#include "stdafx.h"
#include "BLAS.h"
#include "cuBLAS.h"
namespace lab
{
namespace Mathematics
{
namespace CUDA
{
void BLAS::DAXPY(int n, double alpha, const array<double> ^x, int incx, array<double> ^y, int incy)
{
pin_ptr<double> xPtr = &(x[0]);
pin_ptr<double> yPtr = &(y[0]);
pin_ptr<double> alphaPtr = α
cuBLAS::DAXPY(n, alphaPtr, xPtr, incx, yPtr, incy);
}
}
}
}
To test this code, I wrote the following test in C#:
using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using System.Linq;
using lab.Mathematics.CUDA;
namespace lab.Mathematics.CUDA.Test
{
[TestClass]
public class TestBLAS
{
[TestMethod]
public void TestDAXPY()
{
var count = 10;
var alpha = 1.0;
var a = Enumerable.Range(0, count).Select(x => Convert.ToDouble(x)).ToArray();
var b = Enumerable.Range(0, count).Select(x => Convert.ToDouble(x)).ToArray();
// Call CUDA
BLAS.DAXPY(count, alpha, a, 1, b, 1);
// Validate results
for (int i = 0; i < count; i++)
{
Assert.AreEqual(i + i, b[i]);
}
}
}
}
The program compiles with x64 architecture with no error. But the results I get are different every time I run the test. More precisely, the array b
is the result and it has different values every time. And I don't know why.
I am Also adding my cuda code maybe there, someone can find a problem. note that I don't get any error, warning whatsoever while compiling. I am also wondering maybe I have to do some changes in the compilation while I did nothing and used the default options.
void cuBLAS::DAXPY(int n, const double *alpha, const double *x, int incx, double *y, int incy)
{
cudaError_t cudaStat;
cublasStatus_t stat;
// Allocate GPU memory
double *devX, *devY;
cudastat = cudaMalloc((void **)&devX, (size_t)n*sizeof(*devX));
if (cudaStat != cudaSuccess) {
// throw exception
std::ostringstream msg;
msg << "device memory allocation failed: fail.Stat = " << cudaStat;
throw new std::exception(msg.str().c_str());
}
cudaMalloc((void **)&devY, (size_t)n*sizeof(*devY));
// Create cuBLAS handle
cublasHandle_t handle;
cublasCreate(&handle);
// Initialize the input matrix and vector
cublasSetVector(n, sizeof(*devX), x, incx, devX, incx);
cublasSetVector(n, sizeof(*devY), y, incy, devY, incy);
// Call cuBLAS function
cublasDaxpy(handle, n, alpha, devX, incx, devY, incy);
// Retrieve resulting vector
cublasGetVector(n, sizeof(*devY), devY, incy, y, incy);
// Free GPU resources
cudaFree(devX);
cudaFree(devY);
cublasDestroy(handle);
}
EDIT: I Added the new suggestion by David Yaw and also added error check for all cuda operations. but I didn't write all the error checks here due to readability. still not working.
Upvotes: 1
Views: 355
Reputation: 109
So The code written Up is totally perfect. The only problem I had is I didn't compile it properly. according to This Tutorial, every time you make a change in your cuda program (precisley the .cu file), you have to REBUILD the whole project so Prallel Nsight will compile it. otherwise it will stick to the last compilation.
it is a very tiny point but might save a lot of people, a whole day of debugging and getting nowhere.
Upvotes: 0
Reputation: 27864
Your error is in these lines.
// Initialize the input matrix and vector
cublasSetVector(n, sizeof(*devX), x, incx, devX, incx);
// Call cuBLAS function
cublasDaxpy(handle, n, alpha, devX, incx, devY, incy);
// Retrieve resulting vector
cublasGetVector(n, sizeof(*devY), devY, incy, y, incy);
Quoting the documentation (emphasis mine):
This function multiplies the vector x by the scalar α and adds it to the vector y overwriting the latest vector with the result.
Y
is both an input and an output, but you're never setting the value, so you get whatever junk is in the uninitialized memory. Add a call to cublasSetVector
to set the initial value of devY
before you call cublasDaxpy
.
Upvotes: 2