Reputation: 137
This here actually works, so I am wondering is cuda dynamically allocating memory on the device in the thread? If so what is the use of __device__ malloc
since this is much much much faster in comparison? I am asking about what really goes on behind the scene when you use the cudaMalloc in kernel since it seems so much faster then just device malloc on heap.
#include <iostream>
#include <numeric>
#include <stdlib.h>
__global__ void testMem(int* time){
int* a;
cudaMalloc(&a,sizeof(int));
a[0] = 4;
time = a[0];
}
__global__ void testMem2(int* time){
}
int main(){
int* h_time = (int*)malloc(sizeof(int));
h_time[0] =0;
int* d_time;
cudaMalloc(&d_time,sizeof(int));
clock_t start1 = clock();
cudaMemcpy(d_time,h_time,sizeof(int),cudaMemcpyHostToDevice);
testMem<<<1,1>>>(d_time);
cudaMemcpy(h_time,d_time,sizeof(int),cudaMemcpyDeviceToHost);
cudaDeviceSynchronize();
clock_t end1 = clock();
int result = end1- start1;
//float result = (float)*h_time;
//result =result/ CLOCKS_PER_SEC;
std::cout<<result<<std::endl;
std::cout<<*h_time<<std::endl;
//std::cout<<(1<<10);
cudaFree(d_time);
free(h_time);
}
Upvotes: 4
Views: 3953
Reputation: 1
I found something here in the guide, though it's under CUDA dynamic parallelism section. It says the cudaMalloc and cudaFree invoked by the device runtime are mapped to the malloc and free, which operate on the device heap(8MB by default according to this).
Upvotes: 0
Reputation: 2916
Starting compute capability 3.5, you may use part of the cuda runtime api within kernels. These methods are declared as __host__ __device__
in the documentation, just like here:
__host__ __device__ cudaError_t cudaMalloc ( void** devPtr, size_t size )
Allocate memory on the device.
When doing so, remind to link against the device runtime library: cudadevrt.lib
.
There is another way to allocate memory dynamically on the device: the use of malloc
, which is implemented differently (documented here). It is using a small memory heap, and does not require the same compute capability.
Upvotes: 3