Reputation: 137

Whats actually happens when you call cudaMalloc inside device?

This here actually works, so I am wondering is cuda dynamically allocating memory on the device in the thread? If so what is the use of __device__ malloc since this is much much much faster in comparison? I am asking about what really goes on behind the scene when you use the cudaMalloc in kernel since it seems so much faster then just device malloc on heap.

#include <iostream>
#include <numeric>
#include <stdlib.h>

__global__ void testMem(int* time){
    int* a;
    cudaMalloc(&a,sizeof(int));
    a[0] = 4;
    time = a[0];
}

__global__ void testMem2(int* time){

}
int main(){
    int* h_time = (int*)malloc(sizeof(int));
    h_time[0] =0;
    int* d_time;
    cudaMalloc(&d_time,sizeof(int));
    clock_t start1 = clock();
    cudaMemcpy(d_time,h_time,sizeof(int),cudaMemcpyHostToDevice);

    testMem<<<1,1>>>(d_time);
    cudaMemcpy(h_time,d_time,sizeof(int),cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    clock_t end1 = clock();

    int result = end1- start1;
    //float result = (float)*h_time;
    //result =result/ CLOCKS_PER_SEC;
    std::cout<<result<<std::endl;
    std::cout<<*h_time<<std::endl;
    //std::cout<<(1<<10);
    cudaFree(d_time);
    free(h_time);

}

Upvotes: 4

Answers (2)

xioan

Reputation: 1

I found something here in the guide, though it's under CUDA dynamic parallelism section. It says the cudaMalloc and cudaFree invoked by the device runtime are mapped to the malloc and free, which operate on the device heap(8MB by default according to this).

Upvotes: 0

Florent DUGUET

Reputation: 2916

Starting compute capability 3.5, you may use part of the cuda runtime api within kernels. These methods are declared as __host__ __device__ in the documentation, just like here:

__host__ __device__ cudaError_t cudaMalloc ( void** devPtr, size_t size )

Allocate memory on the device.

When doing so, remind to link against the device runtime library: cudadevrt.lib.

There is another way to allocate memory dynamically on the device: the use of malloc, which is implemented differently (documented here). It is using a small memory heap, and does not require the same compute capability.

Upvotes: 3

Whats actually happens when you call cudaMalloc inside device?

Answers (2)

Related Questions