Reputation: 99
The root of the problem is most likely that I didn't allocate enough memory. I will try to think about this and do it correctly and then answer to my question. Silly me. :-[ It doesn't explain the warps not showing up in stdout though...
I created a templated kernel in CUDA in which I iterate over sections of grayscale image data in global memory (shared memory optimizations are due when I get this working) to achieve morphological operations with disc-shaped structure elements. Each thread corresponds to a pixel of the image. When the data type is char
, everything works as expected, all my threads do what they should. When I change it to unsigned short
, it starts acting up and only computes the upper half of my image. When I put in some printfs (my device has 2.0 CC), I found out that some of the warps that should run aren't even computed.
Here's the relevant code.
From my main.cpp I call gcuda::ErodeGpuGray8(img, radius);
and gcuda::ErodeGpuGray16(img, radius);
which are the following functions:
// gcuda.h
…
i3d::Image3d<i3d::GRAY8> ErodeGpuGray8(i3d::Image3d<i3d::GRAY8> img, const unsigned int radius);
i3d::Image3d<i3d::GRAY16> ErodeGpuGray16(i3d::Image3d<i3d::GRAY16> img, const unsigned int radius);
…
// gcuda.cu
…
// call this from outside
Image3d<GRAY8> ErodeGpuGray8(Image3d<GRAY8> img, const unsigned int radius) {
return ErodeGpu<GRAY8>(img, radius);
}
// call this from outside
Image3d<GRAY16> ErodeGpuGray16(Image3d<GRAY16> img, const unsigned int radius) {
return ErodeGpu<GRAY16>(img, radius);
}
…
The library I'm using defines GRAY8
as char
and GRAY16
as unsigned short
.
Here's how I call the kernel (blockSize
is a const int
set to 128 in the relevant namespace):
// gcuda.cu
template<typename T> Image3d<T> ErodeGpu(Image3d<T> img, const unsigned int radius) {
unsigned int width = img.GetWidth();
unsigned int height = img.GetHeight();
unsigned int w = nextHighestPower2(width);
unsigned int h = nextHighestPower2(height);
const size_t n = width * height;
const size_t N = w * h;
Image3d<T>* rslt = new Image3d<T>(img);
T *vx = rslt->GetFirstVoxelAddr();
// kernel parameters
dim3 dimBlock( blockSize );
dim3 dimGrid( ceil( N / (float)blockSize) );
// source voxel array on device (orig)
T *vx_d;
// result voxel array on device (for result of erosion)
T *vxr1_d;
// allocate memory on device
gpuErrchk( cudaMalloc( (void**)&vx_d, n ) );
gpuErrchk( cudaMemcpy( vx_d, vx, n, cudaMemcpyHostToDevice ) );
gpuErrchk( cudaMalloc( (void**)&vxr1_d, n ) );
gpuErrchk( cudaMemcpy( vxr1_d, vx_d, n, cudaMemcpyDeviceToDevice ) );
ErodeGpu<T><<<dimGrid, dimBlock>>>(vx_d, vxr1_d, n, width, radius);
gpuErrchk( cudaMemcpy( vx, vxr1_d, n, cudaMemcpyDeviceToHost ) );
// free device memory
gpuErrchk( cudaFree( vx_d ) );
gpuErrchk( cudaFree( vxr1_d ) );
// for debug purposes
rslt->SaveImage("../erodegpu.png");
return rslt;
}
The dimensions of my testing image are 82x82, so n = 82*82 = 6724 and N = 128*128 = 16384.
This is my kernel:
// gcuda.cu
// CUDA Kernel -- used for image erosion with a circular structure element of radius "erosionR"
template<typename T> __global__ void ErodeGpu(const T *in, T *out, const unsigned int n, const int width, const int erosionR)
{
ErodeOrDilateCore<T>(ERODE, in, out, n, width, erosionR);
}
// The core of erosion or dilation. Operation is determined by the first parameter
template<typename T> __device__ void ErodeOrDilateCore(operation_t operation, const T *in, T *out, const unsigned int n, const int width, const int radius) {
// get thread number, this method is overkill for my purposes but generally should be bulletproof, right?
int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;
int threadId = blockId * (blockDim.x * blockDim.y * blockDim.z) + (threadIdx.z * (blockDim.x * blockDim.y)) + (threadIdx.y * blockDim.x) + threadIdx.x;
int tx = threadId;
if (tx >= n) {
printf("[%d > %d]", tx, n);
return;
} else {
printf("{%d}", tx);
}
… (erosion implementation, stdout is the same when this is commented out so it's probably not the root of the problem)
}
To my understanding, this code should write a randomly sorted set of [X > N]
and {X}
strings to stdout, where X = thread ID and there should be n
curly-bracketed numbers (i.e. the output of threads with the index < n
) and N - n
of the rest, but when I run it and count the curly-bracketed numbers using a regex, I find out that I only get 256 of them. Furthermore, they seem to occur in 32-member groups, which tells me that some warps are run and some are not.
I am really baffled by this. It doesn't help that when I don't comment out the erosion implementation part, the GRAY8 erosion works and the GRAY16 erosion doesn't, even though the stdout output is exactly the same in both cases (could be input-dependent, I only tried this with 2 images).
What am I missing? What could be the cause of this? Is there some memory-management mistake on my part or is it fine that some warps don't run and the erosion stuff is possibly just a bug in the image library that only occurs with the GRAY16 type?
Upvotes: 0
Views: 166
Reputation: 99
So this was just a stupid malloc mistake.
Instead of
const size_t n = width * height;
const size_t N = w * h;
I used
const int n = width * height;
const int N = w * h;
and instead of the erroneous
gpuErrchk( cudaMalloc( (void**)&vx_d, n ) );
gpuErrchk( cudaMemcpy( vx_d, vx, n, cudaMemcpyHostToDevice ) );
gpuErrchk( cudaMalloc( (void**)&vxr1_d, n ) );
gpuErrchk( cudaMemcpy( vxr1_d, vx_d, n, cudaMemcpyDeviceToDevice ) );
…
gpuErrchk( cudaMemcpy( vx, vxr1_d, n, cudaMemcpyDeviceToHost ) );
I used
gpuErrchk( cudaMalloc( (void**)&vx_d, n * sizeof(T) ) );
gpuErrchk( cudaMemcpy( vx_d, vx, n * sizeof(T), cudaMemcpyHostToDevice ) );
gpuErrchk( cudaMalloc( (void**)&vxr1_d, n * sizeof(T) ) );
gpuErrchk( cudaMemcpy( vxr1_d, vx_d, n * sizeof(T), cudaMemcpyDeviceToDevice ) );
…
gpuErrchk( cudaMemcpy( vx, vxr1_d, n * sizeof(T), cudaMemcpyDeviceToHost ) );
and the erosion is working correctly now, which was the main problem I was trying to solve. I'm still not getting the stdout output I'm expecting though, so if someone could shed some light on that, please do so.
Upvotes: 1