Udacity parallel programming, unspecified launch failure cudaGetLastError()

Question

I am trying to complete homework #2 for Udacity course parallel programming. I have ran into a CUDA error that I just can't get around. The error is encoutnered when I launch a kernel that is meant to separate an image in the format "RGBRGBRGB" to three separate arrays of "RRR" "GGG" and "BBB". Seeing as the error "unspecified launch failure" does not give me anything specific to go on I am not sure how to trouble shoot my issue.

Here is the "main" function called to start the entire process. I left out the rest after the error is encountered so that I don't post the rest of my work for someone to find later.

void your_gaussian_blur(const uchar4 * const h_inputImageRGBA, uchar4 * const d_inputImageRGBA, uchar4* const d_outputImageRGBA, const size_t numRows, const size_t numCols,
                        unsigned char *d_redBlurred, 
                        unsigned char *d_greenBlurred, 
                        unsigned char *d_blueBlurred,
                        const int filterWidth)
{

    // Maximum number of threads per block = 512; do this 
    // to keep this compatable with CUDa 5 and lower
    // MAX > threadsX * threadsY * threadsZ
    int MAXTHREADSx = 16;
    int MAXTHREADSy = 16; // 16 x 16 x 1 = 512
    // We want to fill the blocks so we don't waste this blocks threads
    // I wonder if blocks can intermix in a physical core? 
    // Either way this method makes things "clean"; one thread per px
    int nBlockX = numCols / MAXTHREADSx + 1;
    int nBlockY = numRows / MAXTHREADSy + 1;

    const dim3 blockSize(MAXTHREADSx, MAXTHREADSy, 1);
    const dim3 gridSize(nBlockX, nBlockY, 1);

    separateChannels<<>>(
        h_inputImageRGBA,
        numRows,
        numCols,
        d_red,
        d_green,
        d_blue);

  // Call cudaDeviceSynchronize(), then call checkCudaErrors() immediately after
  // launching your kernel to make sure that you didn't make any mistakes.
  cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());

And here is the function separateChannels

//This kernel takes in an image represented as a uchar4 and splits
//it into three images consisting of only one color channel each
__global__
void separateChannels(const uchar4* const inputImageRGBA,
                                int numRows,
                                int numCols,
                                unsigned char* const redChannel,
                                unsigned char* const greenChannel,
                                unsigned char* const blueChannel)
{
    //const int2 thread_2D_pos = make_int2(blockIdx.x * blockDim.x + threadIdx.x, blockIdx.y * blockDim.y + threadIdx.y);
    const int col = blockIdx.x * blockDim.x + threadIdx.x;
    const int row = blockIdx.y * blockDim.y + threadIdx.y;

    //if (thread_2D_pos.x >= numCols || thread_2D_pos.y >= numRows)
    //  return;
    if (col >= numCols || row >= numRows)
        return;

    //const int thread_1D_pos = thread_2D_pos.y * numCols + thread_2D_pos.x;
    int arrayPos = row * numCols + col;

    uchar4 rgba = inputImageRGBA[arrayPos];
    redChannel[arrayPos] = rgba.x;
    greenChannel[arrayPos] = rgba.y;
    blueChannel[arrayPos] = rgba.z;
}

I think I put in anything necessary, please let me know if not.

Michal Hosala · Accepted Answer

Without seeing the rest of the code I cannot tell for sure, but I believe you are sending pointer to host memory as a parameter to cuda kernel - not a good thing to do. In kernel launch you are sending in a h_inputImageRGBA while I believe you want to send in a d_inputImageRGBA.

Typically h_ prefix stands for host memory while d_ represents device.

Udacity parallel programming, unspecified launch failure cudaGetLastError()

Answers (1)

Related Questions