anki
anki

Reputation: 765

Only GPU to CPU transfer with cupy is incredible slow

If I have an array on the GPU, it is really slow (order of hundreds of seconds) to copy back an array of shape (20, 256, 256).

My code is the following:

import cupy as cp
from cupyx.scipy.ndimage import convolve
import numpy as np

# Fast...
xt = np.random.randint(0, 255, (20, 256, 256)).astype(np.float32)
xt_gpu = cp.asarray(xt)

# Also very fast...
result_gpu = convolve(xt_gpu, xt_gpu, mode='constant')

# Very very very very very slow....
result_cpu = cp.asnumpy(result_gpu)

I measured the times using cp.cuda.Event() with record and synchronize to avoid measuring any random times, but is still the same result, the GPU->CPU transfer is incredible slow. However, using PyTorch or TensorFlow this is not the case (out of experience for similar data size/shape)... What am I doing wrong?

Upvotes: 0

Views: 1583

Answers (2)

lt z
lt z

Reputation: 1

I also meet the same problem, I found that accessing Float64 data is way faster than Float32, maybe you can try to .astype(float64).

Upvotes: 0

S. Strempfer
S. Strempfer

Reputation: 298

I think you might be timing it wrong. I modified the code to synchronize between every GPU operation and it seems like the convolution takes the majority of the time with both transfer operations being very fast.

import cupy as cp
from cupyx.scipy.ndimage import convolve
import numpy as np
import time
# Fast...
xt = np.random.randint(0, 255, (20, 256, 256)).astype(np.float32)

t0 = time.time()
xt_gpu = cp.asarray(xt)
cp.cuda.stream.get_current_stream().synchronize()
print(time.time() - t0)

# Also very fast...
t0 = time.time()
result_gpu = convolve(xt_gpu, xt_gpu, mode='constant')
cp.cuda.stream.get_current_stream().synchronize()
print(time.time() - t0)

# Very very very very very slow....
t0 = time.time()
result_cpu = cp.asnumpy(result_gpu)
cp.cuda.stream.get_current_stream().synchronize()
print(time.time() - t0)

Output:

0.1380000114440918
4.032999753952026
0.0010001659393310547

To me it seems like you are not actually synchronizing between calls when you tested it. Until the transfer back to a numpy array all operations are simply queued up and seem to finish instantly without the synchronize calls. This would lead to the measured GPU->CPU transfer time actually being the time for the convolution and the transfer.

Upvotes: 1

Related Questions