Reputation: 21
I am trying to write code that convolves a 3D image with a 3D wavelet kernel that can be described using three independent parameters. I want to analyze the results of the convolution for all combinations of the three parameters used to generate the wavelet.
i.e. create a kernel for a given set of parameters (loop over i
,j
,k
) and convolve with image.
unfortunately the first iteration runs in 0.1s but the next iteration takes 600s!
I think the issue essentially boils down to me using CuPy in a nested for-loop, but I really don't know how to avoid the slow-down I am getting. Maybe this is something to do with the memory queue? However, my images are quite large (800x800x200), so this is why am trying to do it in batches instead of parallel.
Here is a snippet of the relevant code in case it is useful for diagnosing. The function skern3
uses exclusively CuPy arrays and returns a cupy array.
# External Packages
import numpy as np
import cupy as cp
from cupyx.scipy.ndimage import convolve as convolve_gpu
# Custom kernel
def skern3(x, y, z, a, theta, phi, xi):
# stretched mexican hat wavelet
x = x - cp.max(x)/2
y = y - cp.max(y)/2
z = z - cp.max(z)/2
rT = x*cp.cos(phi)*cp.cos(theta) - y*cp.sin(theta) + z*cp.sin(phi)*cp.cos(theta)
rp = x*cp.cos(phi)*cp.sin(theta) + y*cp.cos(theta) + z*cp.sin(phi)*cp.sin(theta)
rp2 = -1*x*cp.sin(phi) + z*cp.cos(phi)
A = (3-(rp/a)**2 - (1/xi**2)*(rT/a)**2 - (rp2/a)**2)
B = cp.exp((-1/2)*((rp/a)**2 + (1/xi**2)*(rT/a)**2 + (rp2/a)**2))
model = A*B
return model
mempool = cp.get_default_memory_pool()
pinned_mempool = cp.get_default_pinned_memory_pool()
mempool.set_limit(10*1024**3)
theta_list = np.arange(0, np.pi, np.pi/100)
phi_list = theta_list
xi_list = np.arange(0,20,1)
a=1
energy_const = (1/np.sqrt(xi_list**2+2*a**2))
fib_sum_log = np.random.rand(200, 200, 200)
shape_im = np.array(fib_sum_log.shape)
w = shape_im[0]
h = shape_im[1]
l = shape_im[2]
x = np.uint16(np.linspace(0, w, w))
y = np.uint16(np.linspace(0, h, h))
z = np.uint16(np.linspace(0, l, l))
x, y, z = np.meshgrid(x, y, z)
x = cp.asarray(x.ravel())
y = cp.asarray(y.ravel())
z = cp.asarray(z.ravel())
xi_list = cp.asarray(xi_list)
theta_list = cp.asarray(theta_list)
phi_list = cp.asarray(phi_list)
a = cp.asarray(a)
fib_sum_log = cp.asarray(fib_sum_log.ravel())
energy_const = cp.asarray(energy_const)
# Pre-calculate wavelet energies (if needed)
wavelet_ener = cp.zeros(xi_list.size)
for k in range(xi_list.size):
xi = xi_list[k]
wavelet = skern3(x,y,z, a, theta_list[0], phi_list[0], xi) # Use arbitrary theta/phi for initial calc
wavelet_ener[k] = cp.sum(cp.abs(wavelet)**2)
# Pre-allocate output arrays on the GPU
trans_ener = cp.zeros((theta_list.size, phi_list.size, xi_list.size))
var_image = cp.zeros((theta_list.size, phi_list.size, xi_list.size))
max_image = cp.zeros((theta_list.size, phi_list.size, xi_list.size))
pinned_mempool.free_all_blocks()
for i in range(theta_list.size):
theta = theta_list[i]
for j in range(phi_list.size):
phi = phi_list[j]
# Calculate wavelet *ONCE* for this theta and phi
for k in range(xi_list.size):
xi = xi_list[k]
wavelet = skern3(x,y,z, a, theta, phi, xi)
# No more freeing memory inside the loop!
G_Ixy = convolve_gpu(fib_sum_log, wavelet) # Make sure fib_sum_log is a CuPy array
pinned_mempool.free_all_blocks()
print('Post Convolution')
print('mempool used bytes: ' + str(mempool.used_bytes()))
print('mempool total bytes: ' + str(mempool.get_limit()))
print('pinned mempool free blocks: ' + str(pinned_mempool.n_free_blocks()))
print('=======================================================================\n')
images_wt = cp.multiply(energy_const[k], G_Ixy) # Make sure energy_const is a CuPy array
trans_ener[i, j, k] = cp.sum(cp.abs(G_Ixy)**2)
var_image[i, j, k] = cp.sqrt(cp.var(G_Ixy))
max_image[i, j, k] = cp.max(G_Ixy)
I tried to make everything inside of the for loops a CuPy array to avoid transfer overhead.
I tried an example without the for-loop, and it works... But I need to iterate, so not sure what I should do.
Something worth noting is that no matter how careful I am about only using CuPy variables, pinned_mempool.n_free_blocks()
always returns 3-5 free blocks in pinned memory. I think this indicates overhead is occurring but I don't know why.
Upvotes: 2
Views: 53
Reputation: 50836
There are many issues in this code.
If you add synchronization instructions like cp.cuda.runtime.deviceSynchronize()
, then you can see that convolve_gpu
is the instruction which takes all the time. Without synchronizations, Cupy continue to execute the code. Indeed, cupy execute things lazily for sake of performance. This is likely an issue since you seems to manually control memory (with pinned_mempool.free_all_blocks()
) without any synchronization before.
If we profile the code, we can see that skern3
launch a LOT of kernels: ~67 kernels are run per call and most of them take less than 1 ms. This is pretty bad. In fact, so many kernels are run (>1000 before even entering the second main loop) that the Nvidia profiler just crashed and I never succeed to reach convolve_gpu
using the profiler without modifying the code. On top of that, some kernels recompute the same things! Indeed, cp.cos(phi)
is recomputed 3 times, as well as cp.cos(theta)
, cp.sin(theta)
and cp.sin(phi)
. This also applies to the several other operations after that like (rp/a)**2
or even 1/xi**2
, etc. that are recomputed twice. Cupy cannot optimize that because it would mean that all operations should be memorized or even that it should optimize the execution graph at runtime which is pretty expensive (since Python code is interpreted so no optimization can be done before actually running the code). This is your job not to do inefficient things like this.
On top of that, Cupy surprisingly does not optimize things like xi**2
: it compute a very-expensive generic pow(xi, 2)
rather than the very-cheap xi * xi
operation... The later is about 30 times faster on my Nvidia-1660S GPU!
While you can mitigate the performance issue by avoiding re-computation and performing just square instead of power, the number of kernel is still significant. This is a problem since each kernel read/write data from/to global memory while some operation can be done in registers which are MUCH more efficient. We can confirm this is a problem in the profiler since most useful kernels are actually memory-bound. Thus, I strongly advise you to write a unified kernel doing all of this at once. It will also ease profiling.
Regarding convolve_gpu
, the two arrays have a size of 8_000_000 which is pretty huge. The algorithmic complexity of the naive convolution is O(n**2)
. This is not reasonable, even on a GPU (8_000_000**2 = 64_000_000_000_000
). The key to make this faster is to use an FFT-based implementation and I do not think convolve_gpu
actually use this approach (developers probably wrongly assumed that the kernel should be small without mentioning it in the documentation). There is a catch though: FFT-based convolutions can apparently be significantly less precise than the naive algorithm. That being said, using FFTs is certainly mandatory regarding the size of the target input here. Thus I advise you use an alternative implementation of convolve_gpu
internally performing FFTs and check the input has a reasonably low standard deviation for results to be precise enough. I also advise you to manually check the results of this convolution.
If you do not find such an implementation, note that you can do it yourself since Cupy supports CuFFT. In fact, you can optimize the computation further using this approach. Indeed, fib_sum_log
is apparently a constant in the loop so you could compute its FFT only once. I expect this to be from 30% to 50% faster.
Additionally, note that wavelet
seems full of NaN
value which indicate that the result will certainly be garbage data and the input is certainly already invalid (maybe just in this synthetic example)...
Considering all the aforementioned issues, I think it might be better for you to first write a fast (parallel) CPU implementation using efficient algorithms and Numba/Cython and only then try to offload the computation on GPU if this is not enough on CPU. On CPU, the FFT-based convolution should be the bottleneck.
Upvotes: 2