cuda thrust::sort met memory problem when I still have enough memory

Question

I am working with cuda10.2, on ubuntu18.04. My gpu is tesla T4, which has 16G memory, and I do not have other programs running on the current gpu. The short piece of code is like following:

#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 
#include 


struct sort_functor {

    thrust::device_ptr data;
    int stride = 1;
    __host__ __device__
    void operator()(int idx) {
        thrust::sort(thrust::device,
                data + idx * stride, 
                data + (idx + 1) * stride);
    }
};


int main() {
    std::random_device rd;
    std::mt19937 engine;
    engine.seed(rd());
    std::uniform_real_distribution u(0, 90.);

    int M = 8;
    int N = 8 * 384 * 300;

    std::vector v(M * N);
    std::generate(v.begin(), v.end(), [&](){return u(engine);});
    thrust::host_vector hv(v.begin(), v.end());
    thrust::device_vector dv = hv;

    thrust::device_vector res(dv.begin(), dv.end());

    thrust::device_vector index(M);
    thrust::sequence(thrust::device, index.begin(), index.end(), 0, 1);

    thrust::for_each(thrust::device, index.begin(), index.end(), 
            sort_functor{res.data(), N}
            );
    cudaDeviceSynchronize();

    return 0;
}

The error message is:

temporary_buffer::allocate: get_temporary_buffer failed
temporary_buffer::allocate: get_temporary_buffer failed
temporary_buffer::allocate: get_temporary_buffer failed
temporary_buffer::allocate: get_temporary_buffer failed
temporary_buffer::allocate: get_temporary_buffer failed
temporary_buffer::allocate: get_temporary_buffer failed
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  for_each: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure
Aborted (core dumped)

How could I solve this problem please ?

Robert Crovella · Accepted Answer

thrust::sort requires O(N) temporary memory allocation. When you call it from device code (in your functor), that temporary memory allocation (for each call - i.e. from each of your 8 calls) will be done on the device, using new or malloc under the hood, and the allocation will come out of the "device heap" space. The device heap space is by default limited to 8MB, but you can change this. You are hitting this limit.

If you add the following at the top of your main routine:

cudaError_t err = cudaDeviceSetLimit(cudaLimitMallocHeapSize, 1048576ULL*1024);

Your code runs without any runtime errors for me.

I'm not suggesting that I calculated the 1GB value above carefully. I simply picked a value much larger than 8MB but much smaller than 16GB, and it seemed to work. You should probably carefully estimate the amount of temporary allocation size you will need, in the general case.

cuda thrust::sort met memory problem when I still have enough memory

Answers (2)

Related Questions