MATLAB produces different result than CUBLAS + Kernel

Question

I have the following MATLAB code :

[N, d] = size(X); % data size and dimensions

R = rand(d,dt); % Form a random matrix with elements in [0,1]

% Random projection
Y = X * R;

w=720; % hashing step

b = w * rand(dt,1);

% Compute the hash codes of the data
binId = floor( bsxfun(@plus, Y, b') / w);

and I tried to make it parallel using CUBLAS and a Kernel as follows :

__global__ void compute(const int N,const int dt,const int w,const float *old, int *newt){
    int col = blockDim.y * blockIdx.y + threadIdx.y;
    int row = blockDim.x * blockIdx.x + threadIdx.x;
    int id = row+N*col;
    if(row d_X(h_X, h_X + N * d);

cudaMalloc(&d_R,d * dt * sizeof(float));
cudaMemcpy(d_R,h_R,d * dt * sizeof(float),cudaMemcpyHostToDevice);

cudaMalloc(&d_B_row,dt * sizeof(float));
cudaMemcpy(d_B_row,h_B_row,dt * sizeof(float),cudaMemcpyHostToDevice);

cudaMalloc(&d_RX,N * dt * sizeof(float));
cudaMalloc(&d_H,N * dt * sizeof(int));

//-------------------------CuBLAS-----------------------

cublasHandle_t handle;
cublasCreate(&handle);

thrust::device_vector d_B_col(N, w);

gpu_blas_mmul(handle, thrust::raw_pointer_cast(&d_B_col[0]), d_B_row, d_RX, N, 1, dt,0.0);

gpu_blas_mmul(handle, thrust::raw_pointer_cast(&d_X[0]), d_R, d_RX, N, d, dt, 1.0);

cublasDestroy(handle);

//-----------------------Kernel----------------------------
dim3 blockSize(BLOCK_SIZE, BLOCK_SIZE,1);
int linGrid1 = (int)ceil(N/(float)BLOCK_SIZE);
int linGrid2 = (int)ceil(dt/(float)BLOCK_SIZE);
dim3 gridSize(linGrid1,linGrid2,1);
compute<<>>(N, dt, w, d_RX, d_H);

In h_X, h_R and h_B_row I have saved (in column-major order) X, R and b produced by MATLAB. The dataset I am using is ANN_SIFT1M from http://corpus-texmex.irisa.fr/

For about 10000 values the results produced are exactly the same, but when I try with 50000 values for example there are some differences which become more and more as the number of values increases.

Any idea about what I am doing wrong?

Michal Hosala · Accepted Answer

Your MATLAB code uses double point precision so the result is more accurate. In contrast to that, CUDA kernel you provided uses single point precision, type float, and therefore produces less accurate result. And as usually when facing single vs. double point precision issue, the problem only gets worse once you start increasing the size of your input data.

Solution would be to use type double instead of float.

MATLAB produces different result than CUBLAS + Kernel

Answers (1)

Related Questions