Reputation: 181
I'm trying to do an exclusive sum reduction in CUDA. I am using the CUB library and have decided to try the CUB::DeviceReduce. However, my result is NaN, and I can't figure out why.
Code is:
#include <cub/cub.cuh>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
using std::cout;
using std::endl;
#define DSIZE 512
void dev_cumsum( const float *dev_inData, float *dev_outData ) {
int n = 512;
void* dev_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceScan::ExclusiveSum(dev_temp_storage,temp_storage_bytes,const_cast<float*>(dev_inData),dev_outData,n);
cudaMalloc(&dev_temp_storage,temp_storage_bytes);
cub::DeviceScan::ExclusiveSum(dev_temp_storage,temp_storage_bytes,const_cast<float*>(dev_inData),dev_outData,n);
}
int main(){
float h_data[512];
float* d_data;
float* d_result;
float h_result[512];
cudaMalloc(&d_data, DSIZE*sizeof(float));
cudaMalloc(&d_result, DSIZE*sizeof(float));
h_data[0] = rand()%10;
h_result[0] = 0;
for (int i=1; i<DSIZE; i++) {
h_data[i] = rand()%10;
h_result[i] = h_data[i-1]+h_result[i-1];
}
cudaMemcpy(d_data, h_data, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
dev_cumsum(d_data, d_result);
printf("CPU result = %f\n", h_result[511]);
cudaMemcpy(h_result, d_result, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
printf("GPU result = %f\n", h_result[511]);
for( int i = 0; i < DSIZE; i++ ) {cout << h_result[i] << " ";}
cout << endl;
return 0;
}
This code gives me NaN for the last 8 elements of the device result.
This code is running on a GTX650 Ti Boost in Linux Mint15. I'm using NSight and the console output compile command is:
Invoking: NVCC Compiler
/usr/local/cuda-5.5/bin/nvcc -G -g -O0 -gencode arch=compute_30,code=sm_30 -odir "" -M -o "main.d" "../main.cu"
/usr/local/cuda-5.5/bin/nvcc --device-c -G -O0 -g -gencode arch=compute_30,code=compute_30 -gencode arch=compute_30,code=sm_30 -x cu -o "main.o" "../main.cu"
Cuda version is 5.5 CUB version 1.0.2
This was tested on another computer with Cuda 6, OSX10.9.2, CUB 1.2.3 and running a GT750M, and reproduced the error of last 8 numbers being NaN
edit: The code works correctly with int and double, but not float.
edit: With thanks to Robert Crovella, this question was originally asked in regards to DeviceReduce. That code worked, it was throwing NaN because earlier code using DeviceScan was feeding it NaN as input. Question is revised to suit
Upvotes: 2
Views: 1476
Reputation: 152073
EDIT: cub 1.3.0 was recently released, and I believe it includes a fix for this issue.
There's a few changes I would make to your code, that I consider to be errors, but I don't know if they are affecting what you are seeing. In the following code section, you are using h_result[0]
without initializing it, so add the line I have marked with a comment:
h_data[0] = rand()%10;
h_result[0] = 0; // ADD THIS LINE
for (int i=1; i<DSIZE; i++) {
h_data[i] = rand()%10;
h_result[i] = h_data[i-1]+h_result[i-1];
}
(Clearly that one should not be influencing your GPU result.) Also, your final cudaMemcpy
operation is not quite right:
cudaMemcpy(&h_result, d_result, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
^
delete this ampersand
Since h_result
is already a pointer in your formulation, we don't need to pass the address of it to cudaMemcpy
.
Can you try making those changes and see what kind of results you get?
I've been struggling a bit with this. If you can still reproduce the error, I'd appreciate it if you can:
Upvotes: 1
Reputation: 11
When I ran the code, I found that it is not the last 8 values that are set to NaN but it is, in fact, all of the values since the last integer-multiple of 72 that are set to NaN. In your example there are 512 values: this means that the first 504 (7 * 72) were correct and the following 8 values were NaN.
This behaviour seems to continues until 568 (8 * 72) values and thereafter it seems to work correctly.
The code that I used to test this is here: http://pastebin.com/kXVvuKAN
I compiled the code with the following command:
nvcc --relocatable-device-code=true -gencode arch=compute_30,code=compute_30 -G -o main main.cu
NOTE: If I didn't use the -G parameter, the results were more random. However, with the -G command, it gave the clear pattern mentioned above.
Upvotes: 1