Reputation: 336
I need to perform a convolution on an image in-place, and by in-place I mean such that as the structuring element gets applied to different pixels, I want the result of the previous steps to overwrite the image. To put it in context, this is useful in Gauss-Seidel iteration.
I'm currently using scipy.ndimage.convolve1d
, which clearly doesn't do in-place convolution as I described. I know how to write a kernel that does that using numba
(basically a for-loop where you overwrite the existing elements as you iterate on pixels), but I was wondering if I could get further speedups on GPU. The issue is that since the array being operated on changes in each iteration, it's not trivial to code it in GPU because of race conditions.
Here's a concrete example:
a = [1, 5, 3, 0, -4, 1]
weights = [1, 2, 1]
Here's what scipy.ndimage.convolve1d
does (assume out
is the result, also assume 0 for values extending the boundaries):
# Step 1: 1*0 + 2*1 + 1*5 = 7 -> out[0], a = [1, 5, 3, 0, -4, 1]
# Step 2: 1*1 + 2*5 + 1*3 = 14 -> out[1], a = [1, 5, 3, 0, -4, 1]
# Step 3: 1*5 + 2*3 + 1*0 = 12 -> out[2], a = [1, 5, 3, 0, -4, 1]
# ...
Here's what I want:
# Step 1: 1*0 + 2*1 + 1*5 = 7 -> a[0], a = [7, 5 , 3 , 0, -4, 1]
# Step 2: 1*7 + 2*5 + 1*3 = 20 -> a[1], a = [7, 20, 3 , 0, -4, 1]
# Step 3: 1*20 + 2*3 + 1*0 = 26 -> a[2], a = [7, 20, 26, 0, -4, 1]
# ...
Here's my numba
implementation for a 2D convolution with the following weights:
w = [[0 1 0],
[1 -4 1],
[0 1 0]]
@numba.njit()
def myconv(u):
Nx, Ny = u.shape
for i in range(1, Nx-1):
for j in range(1, Ny-1):
u[i, j] = u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1] - 4*u[i, j]
Upvotes: 0
Views: 515
Reputation: 14654
If I understood correctly you are trying to run a 2D infinite filter response filter.
If your filter is separable you can do that by applying lfilter, once each axis. Also, this method will shift the output maybe preferrable to use filtfilt.
The example given by the OP is not separable to begin with, however it is simple enough to make the computation reasonable using cumulative sum.
This is the original implementation
def myconv(u):
Nx, Ny = u.shape
for i in range(1, Nx-1):
for j in range(1, Ny-1):
u[i, j] = (u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1] - 4*u[i, j])
Notice that in the j
loop the only modified value in the sum u[i,j-1]
, that is added to u[i,j]
, this operation could be deferred and performed in a second loop without changing the output.
def myconv_worse(u):
Nx, Ny = u.shape
for i in range(1, Nx-1):
for j in range(1, Ny-1):
u[i, j] = (u[i-1, j] + u[i+1, j] + u[i, j+1] - 4*u[i, j])
for j in range(1, Ny-1):
u[i, j] += u[i,j-1]
Then, notice that the second loop could be calculated using cummulative sum
def myconv_vec(u):
Nx, Ny = u.shape
for i in range(1, Nx-1):
u[i,1:-1] = u[i-1,1:-1] + u[i+1,1:-1] + u[i,2:] - 4 * u[i,1:-1]
u[i,:-1] = u[i,:-1].cumsum()
This vectorizes one axis.
This is fully compatible with pytorch that support GPU out of the box.
import torch;
u = torch.randn(10,10, device='cuda')
t1 = u.clone()
t2 = u.clone()
myconv_vec(t1)
myconv(t2)
print(t1 - t2)
Upvotes: 1