user2462730
user2462730

Reputation: 181

Using CUB::DeviceScan

I'm trying to do an exclusive sum reduction in CUDA. I am using the CUB library and have decided to try the CUB::DeviceReduce. However, my result is NaN, and I can't figure out why.

Code is:

#include <cub/cub.cuh>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
using std::cout;
using std::endl;
#define DSIZE 512

void dev_cumsum( const float *dev_inData, float *dev_outData ) {
    int n = 512;
    void* dev_temp_storage = NULL;
    size_t temp_storage_bytes = 0;
    cub::DeviceScan::ExclusiveSum(dev_temp_storage,temp_storage_bytes,const_cast<float*>(dev_inData),dev_outData,n);
    cudaMalloc(&dev_temp_storage,temp_storage_bytes);
    cub::DeviceScan::ExclusiveSum(dev_temp_storage,temp_storage_bytes,const_cast<float*>(dev_inData),dev_outData,n);
}

int main(){
    float h_data[512];
    float* d_data;
    float* d_result;
    float h_result[512];
    cudaMalloc(&d_data, DSIZE*sizeof(float));
    cudaMalloc(&d_result, DSIZE*sizeof(float));
    h_data[0] = rand()%10;
    h_result[0] = 0;
    for (int i=1; i<DSIZE; i++) {
        h_data[i] = rand()%10;
        h_result[i] = h_data[i-1]+h_result[i-1];
    }
    cudaMemcpy(d_data, h_data, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
    dev_cumsum(d_data, d_result);
    printf("CPU result = %f\n", h_result[511]);
    cudaMemcpy(h_result, d_result, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
    printf("GPU result = %f\n", h_result[511]);
    for( int i = 0; i < DSIZE; i++ ) {cout << h_result[i] << " ";}
    cout << endl;
    return 0;
}

This code gives me NaN for the last 8 elements of the device result.

This code is running on a GTX650 Ti Boost in Linux Mint15. I'm using NSight and the console output compile command is:

Invoking: NVCC Compiler
/usr/local/cuda-5.5/bin/nvcc -G -g -O0 -gencode arch=compute_30,code=sm_30 -odir "" -M -o "main.d" "../main.cu"
/usr/local/cuda-5.5/bin/nvcc --device-c -G -O0 -g -gencode arch=compute_30,code=compute_30 -gencode arch=compute_30,code=sm_30  -x cu -o  "main.o" "../main.cu"

Cuda version is 5.5 CUB version 1.0.2

This was tested on another computer with Cuda 6, OSX10.9.2, CUB 1.2.3 and running a GT750M, and reproduced the error of last 8 numbers being NaN

edit: The code works correctly with int and double, but not float.

edit: With thanks to Robert Crovella, this question was originally asked in regards to DeviceReduce. That code worked, it was throwing NaN because earlier code using DeviceScan was feeding it NaN as input. Question is revised to suit

Upvotes: 2

Views: 1476

Answers (2)

Robert Crovella
Robert Crovella

Reputation: 152073

EDIT: cub 1.3.0 was recently released, and I believe it includes a fix for this issue.


There's a few changes I would make to your code, that I consider to be errors, but I don't know if they are affecting what you are seeing. In the following code section, you are using h_result[0] without initializing it, so add the line I have marked with a comment:

h_data[0] = rand()%10;
h_result[0] = 0;    // ADD THIS LINE
for (int i=1; i<DSIZE; i++) {
    h_data[i] = rand()%10;
    h_result[i] = h_data[i-1]+h_result[i-1];
}

(Clearly that one should not be influencing your GPU result.) Also, your final cudaMemcpy operation is not quite right:

cudaMemcpy(&h_result, d_result, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
           ^
           delete this ampersand

Since h_result is already a pointer in your formulation, we don't need to pass the address of it to cudaMemcpy.

Can you try making those changes and see what kind of results you get?

I've been struggling a bit with this. If you can still reproduce the error, I'd appreciate it if you can:

  1. reboot your machine and try again
  2. respond back with the actual updated code you are running, the compile command you are using, the CUDA version, the CUB version, and the GPU you are running on, as well as the system OS. (edit your original question with this info)

Upvotes: 1

user3587566
user3587566

Reputation: 11

When I ran the code, I found that it is not the last 8 values that are set to NaN but it is, in fact, all of the values since the last integer-multiple of 72 that are set to NaN. In your example there are 512 values: this means that the first 504 (7 * 72) were correct and the following 8 values were NaN.

This behaviour seems to continues until 568 (8 * 72) values and thereafter it seems to work correctly.

The code that I used to test this is here: http://pastebin.com/kXVvuKAN

I compiled the code with the following command:

nvcc --relocatable-device-code=true -gencode arch=compute_30,code=compute_30 -G -o main main.cu

NOTE: If I didn't use the -G parameter, the results were more random. However, with the -G command, it gave the clear pattern mentioned above.

Upvotes: 1

Related Questions