Python -- cuda latency?

Question

With various libraries that support GPU programming, I am finding that I get worse performance in my algorithm on GPU vs CPU. I believe this is due to latency communicating between the two devices.

My platform is W10x64 with an i7-7700HQ and GTX 1050 in a Dell XPS 15 laptop.

If I use any library, e.g. pytorch.cuda.FloatTensor, or a cupy.ndarray touching a GPU array seems to require about 20~40us. Here's a MWE:

import cupy as cu

ary = cu.empty((1))
const_one = cu.ones((1))

%timeit ary + const_one
> 18.5 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Operating on 1 element is not what GPUs are for, this is a contrived example to show minimum operation time with two pieces of data, both residing on the GPU.

I believe the structure of cuda code is that a queue of operations is built and consumed as the GPU is capable, so this latency washes away over time or with larger blocks of memory?

Here is a complete comparison between the same algorithm in numpy and cupy, which completes a phase error over a 128x128 optical pupil with double precision, and uses it to create a point spread function.

I have tried to be as careful as possible to mitigate host-device transfers; only the ints for array sizes exist on CPU for cupy, as I could not get them on GPU ahead of time.

Initial setup:

precision = 'float32'
ary_size = 128
pad = ary_size // 2
cu0 = cu.zeros((1))
cu2 = cu.ones((1)) * 2
cu1 = cu.ones((1))

CUDA execution

%%timeit
x = cu.linspace(-cu1, cu1, ary_size, dtype=precision)
y = cu.linspace(-cu1, cu1, ary_size, dtype=precision)
xx, yy = cu.meshgrid(x, y)
rho, phi = cu.sqrt(xx**cu2 + yy**cu2), cu.arctan2(yy, xx)
phase_err = rho ** cu2 * cu.cos(phi)
mask = rho > cu1
wv_ary = cu.exp(1j * cu2 * np.pi * phase_err)
wv_ary[mask] = cu0
padded = cu.pad(wv_ary, ((pad, pad), (pad, pad)), mode='constant', constant_values=0)
psf = fftshift(fft2(ifftshift(padded)))
intensity_psf = abs(psf)**cu2
> 4.73 ms ± 86.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy equivalent:

%%timeit
x = np.linspace(-1, 1, ary_size, dtype=precision)
y = np.linspace(-1, 1, ary_size, dtype=precision)
xx, yy = np.meshgrid(x, y)
rho, phi = np.sqrt(xx**2 + yy**2), np.arctan2(yy, xx)
phase_err = rho ** 2 * np.cos(phi)
mask = rho > 1
wv_ary = np.exp(1j * 2 * np.pi * phase_err)
wv_ary[mask] = 0
padded = np.pad(wv_ary, ((pad, pad), (pad, pad)), mode='constant', constant_values=0)
psf = nfftshift(nfft2(nifftshift(padded)))
intensity_psf = abs(psf)**2
> 7.29 ms ± 63.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So I only get a 35% performance improvement with cuda. I know I don't have a particularly beefy GPU and its fp64 is much worse than fp32 performance; but repeating with f32 precision results in no measurable increase in speed.

I also know that if I change size to a much larger value, e.g. 512, CUDA showcases GPU performance better, with times of 8.19ms for GPU and 144ms for CPU, resp.

So it seems this GPU-CPU coordination latency is what kills me at small array sizes. Is this a quirk of my laptop? It is surprisingly difficult to find information on CPU-GPU latency, but there are some reports I have seen that PCI-E latency is less than 1us. If this were the case, then my cuda code would go around 20x faster and be much more usable.

Python -- cuda latency?

Answers (1)

Related Questions