Blademaster
Blademaster

Reputation: 336

How to apply convolution in-place on GPU (in Python)

I need to perform a convolution on an image in-place, and by in-place I mean such that as the structuring element gets applied to different pixels, I want the result of the previous steps to overwrite the image. To put it in context, this is useful in Gauss-Seidel iteration.

I'm currently using scipy.ndimage.convolve1d, which clearly doesn't do in-place convolution as I described. I know how to write a kernel that does that using numba (basically a for-loop where you overwrite the existing elements as you iterate on pixels), but I was wondering if I could get further speedups on GPU. The issue is that since the array being operated on changes in each iteration, it's not trivial to code it in GPU because of race conditions.

Here's a concrete example:

a = [1, 5, 3, 0, -4, 1]
weights = [1, 2, 1]

Here's what scipy.ndimage.convolve1d does (assume out is the result, also assume 0 for values extending the boundaries):

# Step 1: 1*0 + 2*1 + 1*5 = 7  -> out[0], a = [1, 5, 3, 0, -4, 1]
# Step 2: 1*1 + 2*5 + 1*3 = 14 -> out[1], a = [1, 5, 3, 0, -4, 1]
# Step 3: 1*5 + 2*3 + 1*0 = 12 -> out[2], a = [1, 5, 3, 0, -4, 1]
# ...

Here's what I want:

# Step 1: 1*0 + 2*1 + 1*5 = 7   -> a[0], a = [7, 5 , 3 , 0, -4, 1]
# Step 2: 1*7 + 2*5 + 1*3 = 20  -> a[1], a = [7, 20, 3 , 0, -4, 1]
# Step 3: 1*20 + 2*3 + 1*0 = 26 -> a[2], a = [7, 20, 26, 0, -4, 1]
# ...

Here's my numba implementation for a 2D convolution with the following weights:

w = [[0  1  0], 
     [1 -4  1],
     [0  1  0]]
@numba.njit()
def myconv(u):
    Nx, Ny = u.shape
    for i in range(1, Nx-1):
        for j in range(1, Ny-1):
            u[i, j] = u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1] - 4*u[i, j]

Upvotes: 0

Views: 515

Answers (1)

Bob
Bob

Reputation: 14654

If I understood correctly you are trying to run a 2D infinite filter response filter.

If your filter is separable you can do that by applying lfilter, once each axis. Also, this method will shift the output maybe preferrable to use filtfilt.

Edit

The example given by the OP is not separable to begin with, however it is simple enough to make the computation reasonable using cumulative sum.

This is the original implementation

def myconv(u):
    Nx, Ny = u.shape
    for i in range(1, Nx-1):
        for j in range(1, Ny-1):
            u[i, j] = (u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1] - 4*u[i, j])

Notice that in the j loop the only modified value in the sum u[i,j-1], that is added to u[i,j], this operation could be deferred and performed in a second loop without changing the output.

def myconv_worse(u):
    Nx, Ny = u.shape
    for i in range(1, Nx-1):
        for j in range(1, Ny-1):
            u[i, j] = (u[i-1, j] + u[i+1, j] + u[i, j+1] - 4*u[i, j])
        for j in range(1, Ny-1):
            u[i, j] += u[i,j-1]

Then, notice that the second loop could be calculated using cummulative sum

def myconv_vec(u):
    Nx, Ny = u.shape
    for i in range(1, Nx-1):
        u[i,1:-1] = u[i-1,1:-1] + u[i+1,1:-1] + u[i,2:] - 4 * u[i,1:-1]
        u[i,:-1] = u[i,:-1].cumsum()

This vectorizes one axis.

This is fully compatible with that support GPU out of the box.

import torch;
u = torch.randn(10,10, device='cuda')
t1 = u.clone()
t2 = u.clone()
myconv_vec(t1)
myconv(t2)
print(t1 - t2)

Upvotes: 1

Related Questions