Reputation: 1031
I have a CUDA search function which calculate one single variable. How can I return it back.
__global__
void G_SearchByNameID(node* Node, long nodeCount, long start,char* dest, long answer){
answer = 2;
}
cudaMemcpy(h_answer, d_answer, sizeof(long), cudaMemcpyDeviceToHost);
cudaFree(d_answer);
for both of these lines I get this error: error: argument of type "long" is incompatible with parameter of type "const void *"
Upvotes: 25
Views: 34746
Reputation: 5848
With modern CUDA, in addition to the two answers above, you could also use managed memory or pinned host memory.
Managed memory is memory that is automatically migrated back and forth between the CPU and the GPU memories. It's better for large data because your first access will cause the page fault which will trigger the copying so that the rest of the accesses will be faster. However, the mechanism has overhead so it might be quite slow for just a single word. You can also use CUDA prefetching to force the memory on to CPU or GPU memory if you can predict where it will be needed ahead of time. This is also useful when you don't have enough GPU memory to keep everything in the GPU memory at once.
With cudaMallocHost
, you are allocating memory that is pinned. That means it is never swapped out of CPU memory on to the disk and it's always accessible from the GPU and the CPU. It's easy to use but is slow to read from the GPU because of the round-trip from GPU to CPU memory and back. Writes will be faster. Using this technique is less code than using GPU memory and doing a memcpy but the speed is about the same.
Also remember that, for the pinned memory and for the memcpy solutions, you will need to make sure that you synchronize between or else you might end up in a situation where you try to memcpy or read memory before the kernel is complete. In the other answers above, the cudaMemcpy is synchronizing (no stream specified) so it'll run after the kernels anyway. If you used the pinned memory solution, you would not need to use the memcpy but you will need to synchronize.
Upvotes: 1
Reputation: 17157
I've been using __device__
variables for this purpose, that way you don't have to bother with cudaMalloc
and cudaFree
and you don't have to pass a pointer as a kernel argument, which saves you a register in your kernel to boot.
__device__ long d_answer;
__global__ void G_SearchByNameID() {
d_answer = 2;
}
int main() {
SearchByNameID<<<1,1>>>();
typeof(d_answer) answer;
cudaMemcpyFromSymbol(&answer, "d_answer", sizeof(answer), 0, cudaMemcpyDeviceToHost);
printf("answer: %d\n", answer);
return 0;
}
Upvotes: 40
Reputation: 48330
To get a single result you have to Memcpy it, ie:
#include <assert.h>
__global__ void g_singleAnswer(long* answer){ *answer = 2; }
int main(){
long h_answer;
long* d_answer;
cudaMalloc(&d_answer, sizeof(long));
g_singleAnswer<<<1,1>>>(d_answer);
cudaMemcpy(&h_answer, d_answer, sizeof(long), cudaMemcpyDeviceToHost);
cudaFree(d_answer);
assert(h_answer == 2);
return 0;
}
I guess the error come because you are passing a long value, instead of a pointer to a long value.
Upvotes: 28