Pouya BCD
Pouya BCD

Reputation: 1031

How to return a single variable from a CUDA kernel function?

I have a CUDA search function which calculate one single variable. How can I return it back.

__global__ 
void G_SearchByNameID(node* Node, long nodeCount, long start,char* dest, long answer){
    answer = 2;
}

cudaMemcpy(h_answer, d_answer, sizeof(long), cudaMemcpyDeviceToHost);
cudaFree(d_answer);

for both of these lines I get this error: error: argument of type "long" is incompatible with parameter of type "const void *"

Upvotes: 25

Views: 34746

Answers (3)

Eyal
Eyal

Reputation: 5848

With modern CUDA, in addition to the two answers above, you could also use managed memory or pinned host memory.

Managed memory is memory that is automatically migrated back and forth between the CPU and the GPU memories. It's better for large data because your first access will cause the page fault which will trigger the copying so that the rest of the accesses will be faster. However, the mechanism has overhead so it might be quite slow for just a single word. You can also use CUDA prefetching to force the memory on to CPU or GPU memory if you can predict where it will be needed ahead of time. This is also useful when you don't have enough GPU memory to keep everything in the GPU memory at once.

With cudaMallocHost, you are allocating memory that is pinned. That means it is never swapped out of CPU memory on to the disk and it's always accessible from the GPU and the CPU. It's easy to use but is slow to read from the GPU because of the round-trip from GPU to CPU memory and back. Writes will be faster. Using this technique is less code than using GPU memory and doing a memcpy but the speed is about the same.

Also remember that, for the pinned memory and for the memcpy solutions, you will need to make sure that you synchronize between or else you might end up in a situation where you try to memcpy or read memory before the kernel is complete. In the other answers above, the cudaMemcpy is synchronizing (no stream specified) so it'll run after the kernels anyway. If you used the pinned memory solution, you would not need to use the memcpy but you will need to synchronize.

Upvotes: 1

wich
wich

Reputation: 17157

I've been using __device__ variables for this purpose, that way you don't have to bother with cudaMalloc and cudaFree and you don't have to pass a pointer as a kernel argument, which saves you a register in your kernel to boot.

__device__ long d_answer;

__global__ void G_SearchByNameID() {
  d_answer = 2;
}

int main() {
  SearchByNameID<<<1,1>>>();
  typeof(d_answer) answer;
  cudaMemcpyFromSymbol(&answer, "d_answer", sizeof(answer), 0, cudaMemcpyDeviceToHost);
  printf("answer: %d\n", answer);
  return 0;
}

Upvotes: 40

fabmilo
fabmilo

Reputation: 48330

To get a single result you have to Memcpy it, ie:

#include <assert.h>

__global__ void g_singleAnswer(long* answer){ *answer = 2; }

int main(){

  long h_answer;
  long* d_answer;
  cudaMalloc(&d_answer, sizeof(long));
  g_singleAnswer<<<1,1>>>(d_answer);
  cudaMemcpy(&h_answer, d_answer, sizeof(long), cudaMemcpyDeviceToHost); 
  cudaFree(d_answer);
  assert(h_answer == 2);
  return 0;
}

I guess the error come because you are passing a long value, instead of a pointer to a long value.

Upvotes: 28

Related Questions