Anugerah Erlaut
Anugerah Erlaut

Reputation: 1130

calling CUDA printf float from kernel returning garbage?

I'm trying to print float values from kernel using calls to printf. I'm doing this to check on another program I'm working on that requires copying float arrays from host to device. I wrote a kernel to check the values stored inside the float array in the device, only to get 0 in return.

So I wrote this code to check :

#include <stdio.h>

#define ARR_LENGTH 3

__global__ void checkArr(float* arr);

int main(void)  
{  
    float* arr = (float*) malloc(sizeof(float) * ARR_LENGTH);

    float cont = 0;
    for(int i = 0 ; i < ARR_LENGTH ; i++) {
        arr[i] = cont;
        cont++;
    }

    for(int i = 0 ; i < ARR_LENGTH ; i++) {
        printf("arr[%d] : %f\n", i , arr[i]);
    }


    float* d_arr;
    cudaMalloc((void**) &d_arr, sizeof(float) * ARR_LENGTH);
    cudaMemcpy(d_arr, arr, sizeof(float) * ARR_LENGTH, cudaMemcpyHostToDevice);

    printf("got here\n");

    checkArr<<<1,1>>>(d_arr);

    printf("got here\n");

    float* check = (float*) malloc(sizeof(float) * ARR_LENGTH);
    cudaMemcpy(check, d_arr, sizeof(float) * ARR_LENGTH, cudaMemcpyDeviceToHost);   
    for(int i = 0 ; i < ARR_LENGTH ; i++) {
        printf("arr[%d] : %f\n", i , check[i]);
    }

}

__global__ void checkArr(float* arr ) 
{
    float check = 5.0;
    printf("float check : %f\n", check);
    printf("float check : %f\n", check + 1.0);
    printf("float check : %f\n", check + 2.0);

    for(int i = 0 ; i < ARR_LENGTH ; i++) {
        printf("arr[%d] : %f\n", i , arr[i]);
    }
}

with the output :

arr[0] : 0.000000
arr[1] : 1.000000
arr[2] : 2.000000
got here
float check : 0
float check : 0
float check : 0
arr[0] : 2.4375
arr[1] : 2.4375
arr[2] : 2.4375
got here
arr[0] : 0.000000
arr[1] : 1.000000
arr[2] : 2.000000

if I didn't put the 'float checks :' before printing the values of the array, the values of the array will return 0. It's kinda weird.. any explanation? Does it mean I can't inspect the value of float values inside the device memory? (as you can see, int seems to be returned fine)

I compile the program with -arch=sm_20. As I don't have a CUDA compatible device at home, I compiled and run the check using GPUOcelot. Can you reproduce this error with a compatible device?

Cheers, AErlaut

Upvotes: 0

Views: 2384

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 152173

When I compile and run your code on an actual sm_20 gpu (M2090) I get the following output.

$ ./t97
arr[0] : 0.000000
arr[1] : 1.000000
arr[2] : 2.000000
got here
got here
float check : 5.000000
float check : 6.000000
float check : 7.000000
arr[0] : 0.000000
arr[1] : 1.000000
arr[2] : 2.000000
arr[0] : 0.000000
arr[1] : 1.000000
arr[2] : 2.000000
$

Note that printf from the kernel on a real device is somewhat asynchronous to the printf queue coming from the host, so output can appear to be in a different order.

My point is to suggest to you that GPU behavior and Ocelot behavior may be different. If you continue to post "please check my Ocelot programs on a real GPU for me" I won't respond to those.

Upvotes: 1

Related Questions