scatman
scatman

Reputation: 14555

cudaMalloc failing After Several Hours

Is there any reason why cudaMalloc fails to allocate memory when running a gpu code for 2-3 hours?
I am using a "Process Explorer" program to check the global memory usage. Suddenly cudaMalloc fails to allocate although free global memory on the gpu is still available.

How can i check the main reason of this failure? i am doing this:

if ( cudaSuccess !=cudaMalloc((void **) &arr, sizeof(int)*100)) 
    printf("Cannot Allocate Mem");

is there a better way to print the actual reason of the failure in cuda?

Upvotes: 0

Views: 777

Answers (2)

Maciej Skorski
Maciej Skorski

Reputation: 3344

Compare the cudaMalloc output against the status variable cudaSuccess. For a minimal working example see below, remember to be environmental friendly and free the space!

// nvcc device_query.cu -o device_query; ./device_query

#include <stdio.h> 

int main() {
    int *arr;
    cudaError_t err= cudaMalloc((void **) &arr, sizeof(int)*1024*1024*1024*10);
    if(err != cudaSuccess){
        printf("The error is %s", cudaGetErrorString(err));
    }
    cudaFree( arr );
}

Because of intentionally excessive allocation this gives

root@38c6fcde90a4:/home/zkp/cuZK/test# nvcc device_query.cu -o device_query; ./device_query
The error is out of memory

This example is essentially a recipe from the great book "CUDA by Example" with examples on GitHub.


Even better, include book recipes into your code. The same example now becomes:

// nvcc device_query.cu -o device_query; ./device_query

#include <stdio.h> 
#include "../cuda-by-example/common/book.h" // download locally and reference accordingly

int main() {
    int *arr;
    HANDLE_ERROR( cudaMalloc((void **) &arr, sizeof(int)*1024*1024*1024*10) );
    cudaFree( arr );
}

and runs as

root@82c2bdcd5ad8:/home/cuZK# ./device_query
out of memory in device_query.cu at line 8

Upvotes: 0

Programmer
Programmer

Reputation: 6753

You can do below:

cudaError_t err= cudaMalloc((void **) &arr, sizeof(int)*100);
if(err != cudaSuccess){
     printf("The error is %s", cudaGetErrorString(err));
}

This will print the exact reason of the error. Eg. invalid device pointer means you are accessing a pointer that does not point to anything.

Upvotes: 2

Related Questions