Reputation: 115
I am working with cuda10.2, on ubuntu18.04. My gpu is tesla T4, which has 16G memory, and I do not have other programs running on the current gpu. The short piece of code is like following:
#include <iostream>
#include <algorithm>
#include <random>
#include <vector>
#include <numeric>
#include <algorithm>
#include <chrono>
#include <cuda_runtime.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
struct sort_functor {
thrust::device_ptr<float> data;
int stride = 1;
__host__ __device__
void operator()(int idx) {
thrust::sort(thrust::device,
data + idx * stride,
data + (idx + 1) * stride);
}
};
int main() {
std::random_device rd;
std::mt19937 engine;
engine.seed(rd());
std::uniform_real_distribution<float> u(0, 90.);
int M = 8;
int N = 8 * 384 * 300;
std::vector<float> v(M * N);
std::generate(v.begin(), v.end(), [&](){return u(engine);});
thrust::host_vector<float> hv(v.begin(), v.end());
thrust::device_vector<float> dv = hv;
thrust::device_vector<float> res(dv.begin(), dv.end());
thrust::device_vector<int> index(M);
thrust::sequence(thrust::device, index.begin(), index.end(), 0, 1);
thrust::for_each(thrust::device, index.begin(), index.end(),
sort_functor{res.data(), N}
);
cudaDeviceSynchronize();
return 0;
}
The error message is:
temporary_buffer::allocate: get_temporary_buffer failed
temporary_buffer::allocate: get_temporary_buffer failed
temporary_buffer::allocate: get_temporary_buffer failed
temporary_buffer::allocate: get_temporary_buffer failed
temporary_buffer::allocate: get_temporary_buffer failed
temporary_buffer::allocate: get_temporary_buffer failed
terminate called after throwing an instance of 'thrust::system::system_error'
what(): for_each: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure
Aborted (core dumped)
How could I solve this problem please ?
Upvotes: 2
Views: 1103
Reputation: 3031
While Robert Crovella's answer technically solves the problem, nesting Thrust algorithms and therefore relying on CUDA Dynamic Parallelism (CDP), when knowing the problem size, is inefficient in the first place. You can find my argument on this matter and how Thrust 1.15 deprecated CDP detailed in my answer to Why the iterating range of thrust::reduce is limited to 2048 double?.
For doing a batched sort there is cub::DeviceSegmentedSort
or, as you are sorting float
s, cub::DeviceSegmentedRadixSort
. CUB is used in the backend of Thrust and therefore always available when Thrust (with CUDA backend) is available. These algorithms came with CUB 1.15 in October 2021, i.e. a year too late for OP.
Upvotes: 0
Reputation: 151799
thrust::sort
requires O(N) temporary memory allocation. When you call it from device code (in your functor), that temporary memory allocation (for each call - i.e. from each of your 8 calls) will be done on the device, using new
or malloc
under the hood, and the allocation will come out of the "device heap" space. The device heap space is by default limited to 8MB, but you can change this. You are hitting this limit.
If you add the following at the top of your main
routine:
cudaError_t err = cudaDeviceSetLimit(cudaLimitMallocHeapSize, 1048576ULL*1024);
Your code runs without any runtime errors for me.
I'm not suggesting that I calculated the 1GB value above carefully. I simply picked a value much larger than 8MB but much smaller than 16GB, and it seemed to work. You should probably carefully estimate the amount of temporary allocation size you will need, in the general case.
Upvotes: 3