Slampisko
Slampisko

Reputation: 99

CUDA -- simple code but some of my warps don't run

EDIT: As I was reading this question after myself, I figured it out.

The root of the problem is most likely that I didn't allocate enough memory. I will try to think about this and do it correctly and then answer to my question. Silly me. :-[ It doesn't explain the warps not showing up in stdout though...

Original question

I created a templated kernel in CUDA in which I iterate over sections of grayscale image data in global memory (shared memory optimizations are due when I get this working) to achieve morphological operations with disc-shaped structure elements. Each thread corresponds to a pixel of the image. When the data type is char, everything works as expected, all my threads do what they should. When I change it to unsigned short, it starts acting up and only computes the upper half of my image. When I put in some printfs (my device has 2.0 CC), I found out that some of the warps that should run aren't even computed.

Here's the relevant code.

From my main.cpp I call gcuda::ErodeGpuGray8(img, radius); and gcuda::ErodeGpuGray16(img, radius); which are the following functions:

// gcuda.h
…
i3d::Image3d<i3d::GRAY8> ErodeGpuGray8(i3d::Image3d<i3d::GRAY8> img, const unsigned int radius);
i3d::Image3d<i3d::GRAY16> ErodeGpuGray16(i3d::Image3d<i3d::GRAY16> img, const unsigned int radius);
…

// gcuda.cu
…
// call this from outside
Image3d<GRAY8> ErodeGpuGray8(Image3d<GRAY8> img, const unsigned int radius) {
    return ErodeGpu<GRAY8>(img, radius);
}

// call this from outside
Image3d<GRAY16> ErodeGpuGray16(Image3d<GRAY16> img, const unsigned int radius) {
    return ErodeGpu<GRAY16>(img, radius);
}
…

The library I'm using defines GRAY8 as char and GRAY16 as unsigned short.

Here's how I call the kernel (blockSize is a const int set to 128 in the relevant namespace):

// gcuda.cu
template<typename T> Image3d<T> ErodeGpu(Image3d<T> img, const unsigned int radius) {
    unsigned int width = img.GetWidth();
    unsigned int height = img.GetHeight();
    unsigned int w = nextHighestPower2(width);
    unsigned int h = nextHighestPower2(height);
    const size_t n = width * height;
    const size_t N = w * h;

    Image3d<T>* rslt = new Image3d<T>(img);
    T *vx = rslt->GetFirstVoxelAddr();

    // kernel parameters
    dim3 dimBlock( blockSize );
    dim3 dimGrid( ceil( N / (float)blockSize) );

    // source voxel array on device (orig)
    T *vx_d;

    // result voxel array on device (for result of erosion)
    T *vxr1_d;

    // allocate memory on device
    gpuErrchk( cudaMalloc( (void**)&vx_d, n ) );
    gpuErrchk( cudaMemcpy( vx_d, vx, n, cudaMemcpyHostToDevice ) );

    gpuErrchk( cudaMalloc( (void**)&vxr1_d, n ) );
    gpuErrchk( cudaMemcpy( vxr1_d, vx_d, n, cudaMemcpyDeviceToDevice ) );

    ErodeGpu<T><<<dimGrid, dimBlock>>>(vx_d, vxr1_d, n, width, radius);

    gpuErrchk( cudaMemcpy( vx, vxr1_d, n, cudaMemcpyDeviceToHost ) );

    // free device memory
    gpuErrchk( cudaFree( vx_d ) );
    gpuErrchk( cudaFree( vxr1_d ) );

    // for debug purposes
    rslt->SaveImage("../erodegpu.png");
    return rslt;
}

The dimensions of my testing image are 82x82, so n = 82*82 = 6724 and N = 128*128 = 16384.

This is my kernel:

// gcuda.cu
// CUDA Kernel -- used for image erosion with a circular structure element of radius "erosionR"
template<typename T> __global__ void ErodeGpu(const T *in, T *out, const unsigned int n, const int width, const int erosionR)
{
    ErodeOrDilateCore<T>(ERODE, in, out, n, width, erosionR);
}

// The core of erosion or dilation. Operation is determined by the first parameter
template<typename T> __device__ void ErodeOrDilateCore(operation_t operation, const T *in, T *out, const unsigned int n, const int width, const int radius) {
    // get thread number, this method is overkill for my purposes but generally should be bulletproof, right?
    int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;
    int threadId = blockId * (blockDim.x * blockDim.y * blockDim.z) + (threadIdx.z * (blockDim.x * blockDim.y)) + (threadIdx.y * blockDim.x) + threadIdx.x;
    int tx = threadId;

    if (tx >= n) {
        printf("[%d > %d]", tx, n);
        return;
    } else {
        printf("{%d}", tx);
    }

    … (erosion implementation, stdout is the same when this is commented out so it's probably not the root of the problem)
}

To my understanding, this code should write a randomly sorted set of [X > N] and {X} strings to stdout, where X = thread ID and there should be n curly-bracketed numbers (i.e. the output of threads with the index < n) and N - n of the rest, but when I run it and count the curly-bracketed numbers using a regex, I find out that I only get 256 of them. Furthermore, they seem to occur in 32-member groups, which tells me that some warps are run and some are not.

I am really baffled by this. It doesn't help that when I don't comment out the erosion implementation part, the GRAY8 erosion works and the GRAY16 erosion doesn't, even though the stdout output is exactly the same in both cases (could be input-dependent, I only tried this with 2 images).

What am I missing? What could be the cause of this? Is there some memory-management mistake on my part or is it fine that some warps don't run and the erosion stuff is possibly just a bug in the image library that only occurs with the GRAY16 type?

Upvotes: 0

Views: 166

Answers (1)

Slampisko
Slampisko

Reputation: 99

So this was just a stupid malloc mistake.

Instead of

const size_t n = width * height;
const size_t N = w * h;

I used

const int n = width * height;
const int N = w * h;

and instead of the erroneous

gpuErrchk( cudaMalloc( (void**)&vx_d, n ) );
gpuErrchk( cudaMemcpy( vx_d, vx, n, cudaMemcpyHostToDevice ) );

gpuErrchk( cudaMalloc( (void**)&vxr1_d, n ) );
gpuErrchk( cudaMemcpy( vxr1_d, vx_d, n, cudaMemcpyDeviceToDevice ) );

…

gpuErrchk( cudaMemcpy( vx, vxr1_d, n, cudaMemcpyDeviceToHost ) );

I used

gpuErrchk( cudaMalloc( (void**)&vx_d, n * sizeof(T) ) );
gpuErrchk( cudaMemcpy( vx_d, vx, n * sizeof(T), cudaMemcpyHostToDevice ) );

gpuErrchk( cudaMalloc( (void**)&vxr1_d, n * sizeof(T) ) );
gpuErrchk( cudaMemcpy( vxr1_d, vx_d, n * sizeof(T), cudaMemcpyDeviceToDevice ) );

…

gpuErrchk( cudaMemcpy( vx, vxr1_d, n * sizeof(T), cudaMemcpyDeviceToHost ) );

and the erosion is working correctly now, which was the main problem I was trying to solve. I'm still not getting the stdout output I'm expecting though, so if someone could shed some light on that, please do so.

Upvotes: 1

Related Questions