Why it is so slow to transfer data from GPU when use numba.cuda and CuPy at the same time?

Question

I read the example from Cupy's document about how to use cupy and numba together and using cuda to accelerate the code. https://docs-cupy.chainer.org/en/stable/reference/interoperability.html

And I write a similar code to test it:

import cupy
from numba import cuda
import numpy as np
import time

@cuda.jit('void(float32[:], float32[:], float32[:])')
def add(x, y, out):
        start = cuda.grid(1)
        stride = cuda.gridsize(1)
        for i in range(start, x.shape[0], stride):
                out[i] = x[i] + y[i]


a = cupy.arange(10000000)
b = a * 2
out = cupy.zeros_like(a)

print("add function time consuming:")
s = time.time()
add(a, b, out)
e = time.time()
print(e-s)
s = time.time()
print("out[2]:")
print(out[2])
e = time.time()
print("the time of transfering out[2] out of GPU:")
print(e-s)

s = time.time()
new_OUT = a + b
print("new out[2] which only use cupy:")
print(new_OUT[2])
e = time.time()
print("the total time of running cupy addition and transfering new out[2] out of GPU:")
print(e-s)

The output is:

add function time consuming:
0.0019025802612304688
out[2]:
6
the time of transfering out[2] out of GPU:
1.5608515739440918
new out[2] which only use cupy:
6
the total time of running cupy addition and transfering new out[2] out of GPU:
0.002993345260620117

How can the call of out[2] so slow on the first case?

I am writing some functions which need to deal with some cupy arrays and matrix. The functions work fine, but after the running these functions when I need to do some modifications, even call something like out.shape it is super slow (my matrixes and arrays are very huge).

I am not sure what's going on here since cupy also use cuda, so when I call a + b, it should run on GPU but there is almost no time consuming when I call out[2] to check the value of out[2]. But the consumption is super high for the first case.

Robert Crovella · Accepted Answer

There are at least 2 things to be aware of for understanding your code output:

In CUDA, a kernel launch is usually configured to indicate the grid configuration (number of blocks, number of threads per block). A numba CUDA kernel launch will normally present the grid configuration in square brackets immediately prior to the kernel arguments:
```
kernel_name[grid_configuration](kernel_arguments)
```
In numba CUDA, it is syntactically permissible to omit the square brackets and the grid configuration, which has the implicit meaning of a grid configuration of [1,1] (one block, consisting of one thread). Your kernel works with a more-or-less arbitrary grid configuration, because it employs a grid-stride loop. However this does not mean that grid configuration doesn't matter. It does matter for performance. A grid configuration of [1,1] will give dismal performance and should never be used in a CUDA kernel launch where performance matters. Thus we can rectify this with a change to your kernel invocation such as:
```
add[1024,256](a, b, out)
```
which will launch a grid of 1024 blocks, each with 256 threads.
In CUDA, kernel launches are asynchronous. This means the host code launching the kernel will initiate the launch but will not wait for the kernel to complete. The same applies to numba CUDA kernel launches. Therefore timing measurements of the kernel launch itself will generally be confusing. For timing purposes, this can be adjusted by forcing the CPU thread to wait inside the timing region, until the kernel is complete. In numba CUDA we can accomplish this with:
```
cuda.synchronize()
```

Why it is so slow to transfer data from GPU when use numba.cuda and CuPy at the same time?

Answers (1)

Related Questions