Reputation: 765
If I have an array on the GPU, it is really slow (order of hundreds of seconds) to copy back an array of shape (20, 256, 256).
My code is the following:
import cupy as cp
from cupyx.scipy.ndimage import convolve
import numpy as np
# Fast...
xt = np.random.randint(0, 255, (20, 256, 256)).astype(np.float32)
xt_gpu = cp.asarray(xt)
# Also very fast...
result_gpu = convolve(xt_gpu, xt_gpu, mode='constant')
# Very very very very very slow....
result_cpu = cp.asnumpy(result_gpu)
I measured the times using cp.cuda.Event()
with record
and synchronize
to avoid measuring any random times, but is still the same result, the GPU->CPU transfer is incredible slow. However, using PyTorch or TensorFlow this is not the case (out of experience for similar data size/shape)... What am I doing wrong?
Upvotes: 0
Views: 1583
Reputation: 1
I also meet the same problem, I found that accessing Float64 data is way faster than Float32, maybe you can try to .astype(float64).
Upvotes: 0
Reputation: 298
I think you might be timing it wrong. I modified the code to synchronize between every GPU operation and it seems like the convolution takes the majority of the time with both transfer operations being very fast.
import cupy as cp
from cupyx.scipy.ndimage import convolve
import numpy as np
import time
# Fast...
xt = np.random.randint(0, 255, (20, 256, 256)).astype(np.float32)
t0 = time.time()
xt_gpu = cp.asarray(xt)
cp.cuda.stream.get_current_stream().synchronize()
print(time.time() - t0)
# Also very fast...
t0 = time.time()
result_gpu = convolve(xt_gpu, xt_gpu, mode='constant')
cp.cuda.stream.get_current_stream().synchronize()
print(time.time() - t0)
# Very very very very very slow....
t0 = time.time()
result_cpu = cp.asnumpy(result_gpu)
cp.cuda.stream.get_current_stream().synchronize()
print(time.time() - t0)
Output:
0.1380000114440918
4.032999753952026
0.0010001659393310547
To me it seems like you are not actually synchronizing between calls when you tested it. Until the transfer back to a numpy array all operations are simply queued up and seem to finish instantly without the synchronize calls. This would lead to the measured GPU->CPU transfer time actually being the time for the convolution and the transfer.
Upvotes: 1