CuPy ndimage convolution in a nested for-loop seems fast but the next execution is stalled

Question

I am trying to write code that convolves a 3D image with a 3D wavelet kernel that can be described using three independent parameters. I want to analyze the results of the convolution for all combinations of the three parameters used to generate the wavelet.

i.e. create a kernel for a given set of parameters (loop over i,j,k) and convolve with image.

unfortunately the first iteration runs in 0.1s but the next iteration takes 600s!

I think the issue essentially boils down to me using CuPy in a nested for-loop, but I really don't know how to avoid the slow-down I am getting. Maybe this is something to do with the memory queue? However, my images are quite large (800x800x200), so this is why am trying to do it in batches instead of parallel.

Here is a snippet of the relevant code in case it is useful for diagnosing. The function skern3 uses exclusively CuPy arrays and returns a cupy array.

# External Packages
import numpy as np
import cupy as cp
from cupyx.scipy.ndimage import convolve as convolve_gpu

# Custom kernel
def skern3(x, y, z, a, theta, phi, xi):
    # stretched mexican hat wavelet
    x = x - cp.max(x)/2
    y = y - cp.max(y)/2
    z = z - cp.max(z)/2
    rT = x*cp.cos(phi)*cp.cos(theta) - y*cp.sin(theta) + z*cp.sin(phi)*cp.cos(theta)
    rp = x*cp.cos(phi)*cp.sin(theta) + y*cp.cos(theta) + z*cp.sin(phi)*cp.sin(theta)
    rp2 = -1*x*cp.sin(phi) + z*cp.cos(phi)
    A = (3-(rp/a)**2 - (1/xi**2)*(rT/a)**2 - (rp2/a)**2)
    B = cp.exp((-1/2)*((rp/a)**2 + (1/xi**2)*(rT/a)**2 + (rp2/a)**2))
    model = A*B
    return  model

mempool = cp.get_default_memory_pool()
pinned_mempool = cp.get_default_pinned_memory_pool()
mempool.set_limit(10*1024**3)

theta_list = np.arange(0, np.pi, np.pi/100)
phi_list = theta_list
xi_list = np.arange(0,20,1)
a=1
energy_const = (1/np.sqrt(xi_list**2+2*a**2))
fib_sum_log = np.random.rand(200, 200, 200)

shape_im = np.array(fib_sum_log.shape)
w = shape_im[0]
h = shape_im[1]
l = shape_im[2]
x = np.uint16(np.linspace(0, w, w))
y = np.uint16(np.linspace(0, h, h))
z = np.uint16(np.linspace(0, l, l))
x, y, z = np.meshgrid(x, y, z)

x = cp.asarray(x.ravel())
y = cp.asarray(y.ravel())
z = cp.asarray(z.ravel())
xi_list = cp.asarray(xi_list)
theta_list = cp.asarray(theta_list)
phi_list = cp.asarray(phi_list)
a = cp.asarray(a)
fib_sum_log = cp.asarray(fib_sum_log.ravel())
energy_const = cp.asarray(energy_const)

# Pre-calculate wavelet energies (if needed)
wavelet_ener = cp.zeros(xi_list.size)
for k in range(xi_list.size):
    xi = xi_list[k]
    wavelet = skern3(x,y,z, a, theta_list[0], phi_list[0], xi)  # Use arbitrary theta/phi for initial calc
    wavelet_ener[k] = cp.sum(cp.abs(wavelet)**2)

# Pre-allocate output arrays on the GPU
trans_ener = cp.zeros((theta_list.size, phi_list.size, xi_list.size))
var_image = cp.zeros((theta_list.size, phi_list.size, xi_list.size))
max_image = cp.zeros((theta_list.size, phi_list.size, xi_list.size))
pinned_mempool.free_all_blocks()
for i in range(theta_list.size):
    theta = theta_list[i]
    for j in range(phi_list.size):
        phi = phi_list[j]
        # Calculate wavelet *ONCE* for this theta and phi
        for k in range(xi_list.size):
            xi = xi_list[k]
            wavelet = skern3(x,y,z, a, theta, phi, xi)
            # No more freeing memory inside the loop!
            G_Ixy = convolve_gpu(fib_sum_log, wavelet)  # Make sure fib_sum_log is a CuPy array
            pinned_mempool.free_all_blocks()
            print('Post Convolution')
            print('mempool used bytes:  ' + str(mempool.used_bytes()))   
            print('mempool total bytes: ' + str(mempool.get_limit()))  
            print('pinned mempool free blocks: ' + str(pinned_mempool.n_free_blocks()))  
            print('=======================================================================
')
            images_wt = cp.multiply(energy_const[k], G_Ixy)  # Make sure energy_const is a CuPy array
            trans_ener[i, j, k] = cp.sum(cp.abs(G_Ixy)**2)
            var_image[i, j, k] = cp.sqrt(cp.var(G_Ixy))
            max_image[i, j, k] = cp.max(G_Ixy)

I tried to make everything inside of the for loops a CuPy array to avoid transfer overhead.

I tried an example without the for-loop, and it works... But I need to iterate, so not sure what I should do.

Something worth noting is that no matter how careful I am about only using CuPy variables, pinned_mempool.n_free_blocks() always returns 3-5 free blocks in pinned memory. I think this indicates overhead is occurring but I don't know why.

CuPy ndimage convolution in a nested for-loop seems fast but the next execution is stalled

Answers (1)

Related Questions