Brandon Dube
Brandon Dube

Reputation: 448

Python -- cuda latency?

With various libraries that support GPU programming, I am finding that I get worse performance in my algorithm on GPU vs CPU. I believe this is due to latency communicating between the two devices.

My platform is W10x64 with an i7-7700HQ and GTX 1050 in a Dell XPS 15 laptop.

If I use any library, e.g. pytorch.cuda.FloatTensor, or a cupy.ndarray touching a GPU array seems to require about 20~40us. Here's a MWE:

import cupy as cu

ary = cu.empty((1))
const_one = cu.ones((1))

%timeit ary + const_one
> 18.5 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Operating on 1 element is not what GPUs are for, this is a contrived example to show minimum operation time with two pieces of data, both residing on the GPU.

I believe the structure of cuda code is that a queue of operations is built and consumed as the GPU is capable, so this latency washes away over time or with larger blocks of memory?

Here is a complete comparison between the same algorithm in numpy and cupy, which completes a phase error over a 128x128 optical pupil with double precision, and uses it to create a point spread function.

I have tried to be as careful as possible to mitigate host-device transfers; only the ints for array sizes exist on CPU for cupy, as I could not get them on GPU ahead of time.

Initial setup:

precision = 'float32'
ary_size = 128
pad = ary_size // 2
cu0 = cu.zeros((1))
cu2 = cu.ones((1)) * 2
cu1 = cu.ones((1))

CUDA execution

%%timeit
x = cu.linspace(-cu1, cu1, ary_size, dtype=precision)
y = cu.linspace(-cu1, cu1, ary_size, dtype=precision)
xx, yy = cu.meshgrid(x, y)
rho, phi = cu.sqrt(xx**cu2 + yy**cu2), cu.arctan2(yy, xx)
phase_err = rho ** cu2 * cu.cos(phi)
mask = rho > cu1
wv_ary = cu.exp(1j * cu2 * np.pi * phase_err)
wv_ary[mask] = cu0
padded = cu.pad(wv_ary, ((pad, pad), (pad, pad)), mode='constant', constant_values=0)
psf = fftshift(fft2(ifftshift(padded)))
intensity_psf = abs(psf)**cu2
> 4.73 ms ± 86.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy equivalent:

%%timeit
x = np.linspace(-1, 1, ary_size, dtype=precision)
y = np.linspace(-1, 1, ary_size, dtype=precision)
xx, yy = np.meshgrid(x, y)
rho, phi = np.sqrt(xx**2 + yy**2), np.arctan2(yy, xx)
phase_err = rho ** 2 * np.cos(phi)
mask = rho > 1
wv_ary = np.exp(1j * 2 * np.pi * phase_err)
wv_ary[mask] = 0
padded = np.pad(wv_ary, ((pad, pad), (pad, pad)), mode='constant', constant_values=0)
psf = nfftshift(nfft2(nifftshift(padded)))
intensity_psf = abs(psf)**2
> 7.29 ms ± 63.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So I only get a 35% performance improvement with cuda. I know I don't have a particularly beefy GPU and its fp64 is much worse than fp32 performance; but repeating with f32 precision results in no measurable increase in speed.

I also know that if I change size to a much larger value, e.g. 512, CUDA showcases GPU performance better, with times of 8.19ms for GPU and 144ms for CPU, resp.

So it seems this GPU-CPU coordination latency is what kills me at small array sizes. Is this a quirk of my laptop? It is surprisingly difficult to find information on CPU-GPU latency, but there are some reports I have seen that PCI-E latency is less than 1us. If this were the case, then my cuda code would go around 20x faster and be much more usable.

Upvotes: 0

Views: 699

Answers (1)

Florent DUGUET
Florent DUGUET

Reputation: 2916

It seems that all your operations are memory bound, probably except exp and atan in double on your GPU. The memory bandwidth of your GPU seems to be 112GB/s according to GeForce website. Your CPU may have around 37GB/s bandwidth according to ark.intel.com. That's an x4.

Note that the small dataset does fit in L2 cache of the CPU, so you can assume that the read following a write is in cache (order of magnitude faster than dram). That could play a x2.

Finally, when launching such operation on GPU, the size of the problem is not large enough for the GPU to hide latency, so you don't get the full bandwidth: the cost of a read is closer to its than its latency than its throughput. If you populate half the read bus, you obtain half the bandwidth.

All these could be verified or not profiling your code with NV prof. You should then see the timing of individual kernels and latency.

Upvotes: 1

Related Questions