Reputation: 5851
I have a CUDA program that uses thrust::reduce to parallelize sums: for example,
thrust::device_ptr<double> tmp(aux);
double my_sum = thrust::reduce(tmp, tmp + G);
where double* aux
points to G
contiguous doubles on the device. I need to compare the runtime of the whole parallelized program to a version with no parallel computation. Is there a way to run thrust::reduce
using only a single thread on the device? A global switch would be the most convenient option.
Upvotes: 2
Views: 147
Reputation: 72372
You should be able to do this by invoking thrust::reduce
within a kernel using the serial execution policy and then launching that kernel with a single thread. Something like:
__global__ void serial_reduce(double *result, double *aux, int G)
{
*result = thrust::reduce(thrust::seq, aux, aux+G);
}
double *result;
cudaMallocManaged(&result, sizeof(double));
serial_reduce<<<1,1>>>(result, aux, G);
cudaDeviceSynchronize();
[Note written in browser and totally untested, use at own risk]
Upvotes: 6