Reputation: 27
I would like to implement a variant of convolution in pycuda.
For simplicity, I'll show rectangle kernel of the interpolation. The standard convolution can be applied as following:
import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda.compiler import SourceModule
mod = SourceModule("""
#include <stdio.h>
__global__ void func(float *dest, float *a)
{
const int img_size = 64;
const int kernel_size = 3;
const int kernel_size_half = kernel_size/2;
const int tx = blockIdx.x * blockDim.x + threadIdx.x;
const int ty = blockIdx.y * blockDim.y + threadIdx.y;
int tx_kernel;
tx_kernel = tx - kernel_size_half;
for (int idx=-kernel_size_half; idx <= kernel_size_half; idx++)
{
tx_kernel = tx + idx ;
if ((tx_kernel < 0) || (tx_kernel > img_size-1))
continue;
dest[ty * img_size + tx] += a[ty * img_size + tx_kernel] / ((float) kernel_size);
}
}
""")
Instead of calculating the current position wrt neighbours, I would like to do the opposite, to add the value of the current pixel to the neighbours.
I.e:
to change the line:
dest[ty * img_size + tx] += a[ty * img_size + tx_kernel] / ((float) kernel_size);
to:
dest[ty * img_size + tx_kernel] += a[ty * img_size + tx] / ((float) kernel_size);
However, The first works fine but the second is not, it fails by updating the neighbours. Is there a way to bypass it?
Note: I simplified the question to focus on what I need, the general problem is to use a different convolution kernel for each pixel instead of same one as I asked in the question.
Upvotes: 1
Views: 88
Reputation: 152123
to change the line:
dest[ty * img_size + tx] += a[ty * img_size + tx_kernel] / ((float) kernel_size);
to:
dest[ty * img_size + tx_kernel] += a[ty * img_size + tx] / ((float) kernel_size);
However, The first works fine but the second is not, it fails by updating the neighbours. Is there a way to bypass it?
The first method is preferred from a performance perspective. However if you wish to "update the neighbors" then it should be possible to recast the second operation as:
atomicAdd(&(dest[ty * img_size + tx_kernel]), a[ty * img_size + tx] / ((float) kernel_size));
Upvotes: 2