Reputation: 139
I'm not sure if this is possible or not in an asynchronous fashion, but what I'd like to do is the following. Suppose I have the following array on device:
d_arr = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Also suppose I have a host array h_arr
of size 3
. Lastly, suppose I have the following array of pointers:
p_arr = [&d_arr[0], &d_arr[4], &d_arr[8]]
I would like to call an imaginary function
cudaMemcpyAsyncDisjoint(&d_arr[0], &h_arr[0], &p_arr[0], 3)
Which then fills the array h_arr
such that it becomes
h_arr = [0, 4, 8]
I want to do this asynchronously because my main concern is speed, since I have a driver method which runs kernels in a loop and then copies back data at the end of each round.
Upvotes: 0
Views: 162
Reputation: 152164
If the strides between elements of p_arr
are constant, then this is possible in a single operation with cudaMemcpy2DAsync
.
For varying strides its not possible in a single operation. Furthermore, the single operation method (with constant stride) is not necessarily the fastest way (the cudaMemcpy2DAsync
method does not necessarily get close to expected bus transfer speeds). For the fastest method, plus ability to handle varying strides between elements to be copied, the usual recommendation is to to break this into two steps.
thrust::gather
(or thrust::copy
with a permutation iterator) to collect all elements to be copied into a contiguous, temporary device buffercudaMemcpyAsync
to copy that buffer to host.Upvotes: 2