Reputation: 10307
I was hoping someone could give me a hand here. I've been getting my feet wet in CUDA, and wrote a simple kernel to negate an image. It works brilliantly and I'm pretty happy with it.
I guess my rather stupid question is... is there anyway I could optimize this kernel? I tried to use shared memory, however the number of pixels is 19224000.
I tried to just do __shared__ int sharedMem[19224000]
, which simply didn't run. I'm a little lost here, as a CUDA programmer could probably tell.
Here is my kernel:
__global__ void cuda_negate_image(int * new_array, int * old_array, int rows, int cols){
int tIdx = threadIdx.x;
int i = blockDim.x * blockIdx.x + threadIdx.x;
int n = rows * cols;
if (i < n)
new_array[i] = -(old_array[i]) + 255;
}
Any help would be awesome!
Upvotes: 2
Views: 234
Reputation: 72351
There isn't much scope for optimisation here. For simple, memory bound operations the four golden rules are usually:
Apply those principles to your kernel and I get something like this:
__device__ __forceinline__ void negate(int &in, int &out)
{
out = 255 - in;
}
__device__ __forceinline__ void negate(int2 &in, int2 & out)
{
negate(in.x, out.x);
negate(in.y, out.y);
}
__device__ __forceinline__ void negate(int4 &in, int4 & out)
{
negate(in.x, out.x);
negate(in.y, out.y);
negate(in.z, out.z);
negate(in.w, out.w);
}
template<typename T>
__global__ void cuda_negate_image(T * __restrict__ new_array, T * __restrict__ old_array, int n)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
T oldval, newval;
for(; i < n; i += stride) {
oldval = old_array[i];
negate(oldval, newval);
new_array[i] = newval;
}
}
template __global__ void cuda_negate_image<int>(int * __restrict__ new_array, int * __restrict__ old_array, int n);
template __global__ void cuda_negate_image<int2>(int2 * __restrict__ new_array, int2 * __restrict__ old_array, int n);
template __global__ void cuda_negate_image<int4>(int4 * __restrict__ new_array, int4 * __restrict__ old_array, int n);
Only benchmarking on your target hardware will tell you which version of the code is the fastest and whether this is even worth bothering with.
Upvotes: 4