CoinCheung
CoinCheung

Reputation: 115

cuda thrust::sort met memory problem when I still have enough memory

I am working with cuda10.2, on ubuntu18.04. My gpu is tesla T4, which has 16G memory, and I do not have other programs running on the current gpu. The short piece of code is like following:

#include <iostream>
#include <algorithm>
#include <random>
#include <vector>
#include <numeric>
#include <algorithm>
#include <chrono>

#include <cuda_runtime.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>


struct sort_functor {

    thrust::device_ptr<float> data;
    int stride = 1;
    __host__ __device__
    void operator()(int idx) {
        thrust::sort(thrust::device,
                data + idx * stride, 
                data + (idx + 1) * stride);
    }
};


int main() {
    std::random_device rd;
    std::mt19937 engine;
    engine.seed(rd());
    std::uniform_real_distribution<float> u(0, 90.);

    int M = 8;
    int N = 8 * 384 * 300;

    std::vector<float> v(M * N);
    std::generate(v.begin(), v.end(), [&](){return u(engine);});
    thrust::host_vector<float> hv(v.begin(), v.end());
    thrust::device_vector<float> dv = hv;

    thrust::device_vector<float> res(dv.begin(), dv.end());

    thrust::device_vector<int> index(M);
    thrust::sequence(thrust::device, index.begin(), index.end(), 0, 1);

    thrust::for_each(thrust::device, index.begin(), index.end(), 
            sort_functor{res.data(), N}
            );
    cudaDeviceSynchronize();

    return 0;
}

The error message is:

temporary_buffer::allocate: get_temporary_buffer failed
temporary_buffer::allocate: get_temporary_buffer failed
temporary_buffer::allocate: get_temporary_buffer failed
temporary_buffer::allocate: get_temporary_buffer failed
temporary_buffer::allocate: get_temporary_buffer failed
temporary_buffer::allocate: get_temporary_buffer failed
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  for_each: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure
Aborted (core dumped)

How could I solve this problem please ?

Upvotes: 2

Views: 1103

Answers (2)

paleonix
paleonix

Reputation: 3031

While Robert Crovella's answer technically solves the problem, nesting Thrust algorithms and therefore relying on CUDA Dynamic Parallelism (CDP), when knowing the problem size, is inefficient in the first place. You can find my argument on this matter and how Thrust 1.15 deprecated CDP detailed in my answer to Why the iterating range of thrust::reduce is limited to 2048 double?.

For doing a batched sort there is cub::DeviceSegmentedSort or, as you are sorting floats, cub::DeviceSegmentedRadixSort. CUB is used in the backend of Thrust and therefore always available when Thrust (with CUDA backend) is available. These algorithms came with CUB 1.15 in October 2021, i.e. a year too late for OP.

Upvotes: 0

Robert Crovella
Robert Crovella

Reputation: 151799

thrust::sort requires O(N) temporary memory allocation. When you call it from device code (in your functor), that temporary memory allocation (for each call - i.e. from each of your 8 calls) will be done on the device, using new or malloc under the hood, and the allocation will come out of the "device heap" space. The device heap space is by default limited to 8MB, but you can change this. You are hitting this limit.

If you add the following at the top of your main routine:

cudaError_t err = cudaDeviceSetLimit(cudaLimitMallocHeapSize, 1048576ULL*1024);

Your code runs without any runtime errors for me.

I'm not suggesting that I calculated the 1GB value above carefully. I simply picked a value much larger than 8MB but much smaller than 16GB, and it seemed to work. You should probably carefully estimate the amount of temporary allocation size you will need, in the general case.

Upvotes: 3

Related Questions