Icarus
Icarus

Reputation: 225

Memory Error in CUDA Program for Fermi GPU

I am facing the following problem on a GeForce GTX 580 (Fermi-class) GPU.

Just to give you some background, I am reading single-byte samples packed in the following manner in a file: Real(Signal 1), Imaginary(Signal 1), Real(Signal 2), Imaginary(Signal 2). (Each byte is a signed char, taking values between, -128 and 127.) I read these into a char4 array, and use the kernel given below to copy them to two float2 arrays corresponding to each signal. (This is just an isolated part of a larger program.)

When I run the program using cuda-memcheck, I get either an unqualified unspecified launch failure, or the same message along with User Stack Overflow or Breakpoint Hit or Invalid __global__ write of size 8 at random thread and block indices.

The main kernel and launch-related code is reproduced below. The strange thing is that this code works (and cuda-memcheck throws no error) on a non-Fermi-class GPU that I have access to. Another thing that I observed is that the Fermi gives no error for N less than 16384.

#define N   32768

int main(int argc, char *argv[])
{
    char4 *pc4Buf_h = NULL;
    char4 *pc4Buf_d = NULL;
    float2 *pf2InX_d = NULL;
    float2 *pf2InY_d = NULL;
    dim3 dimBCopy(1, 1, 1);
    dim3 dimGCopy(1, 1);
    ...
    /* i do check for errors in the actual code */
    pc4Buf_h = (char4 *) malloc(N * sizeof(char4));
    (void) cudaMalloc((void **) &pc4Buf_d, N * sizeof(char4));
    (void) cudaMalloc((void **) &pf2InX_d, N * sizeof(float2));
    (void) cudaMalloc((void **) &pf2InY_d, N * sizeof(float2));
    ...
    dimBCopy.x = 1024;  /* number of threads in a block, for my GPU */
    dimGCopy.x = N / 1024;
    CopyDataForFFT<<<dimGCopy, dimBCopy>>>(pc4Buf_d,
                                           pf2InX_d,
                                           pf2InY_d);
    ...
}

__global__ void CopyDataForFFT(char4 *pc4Data,
                               float2 *pf2FFTInX,
                               float2 *pf2FFTInY)
{
    int i = (blockIdx.x * blockDim.x) + threadIdx.x;

    pf2FFTInX[i].x = (float) pc4Data[i].x;
    pf2FFTInX[i].y = (float) pc4Data[i].y;
    pf2FFTInY[i].x = (float) pc4Data[i].z;
    pf2FFTInY[i].y = (float) pc4Data[i].w;

    return;
}

One other thing I noticed in my program is that if I comment out any two char-to-float assignment statements in my kernel, there's no memory error. One other thing I noticed in my program is that if I comment out either the first two or the last two char-to-float assignment statements in my kernel, there's no memory error. If I comment out one from the first two (pf2FFTInX), and another from the second two (pf2FFTInY), errors still crop up, but less frequently. The kernel uses 6 registers with all four assignment statements uncommented, and uses 5 4 registers with two assignment statements commented out.

I tried the 32-bit toolkit in place of the 64-bit toolkit, 32-bit compilation with the -m32 compiler option, running without X windows, etc. but the program behaviour is the same.

I use CUDA 4.0 driver and runtime (also tried CUDA 3.2) on RHEL 5.6. The GPU compute capability is 2.0.

Please help! I could post the entire code if anybody is interested in running it on their Fermi cards.

UPDATE: Just for the heck of it, I inserted a __syncthreads() between the pf2FFTInX and the pf2FFTInY assignment statements, and memory errors disappeared for N = 32768. But at N = 65536, I still get errors. <-- This didn't last long. Still getting errors.

UPDATE: In continuing with the weird behaviour, when I run the program using cuda-memcheck, I get these 16x16 blocks of multi-coloured pixels distributed randomly all over my screen. This does not happen if I run the program directly.

Upvotes: 3

Views: 1087

Answers (1)

harrism
harrism

Reputation: 27829

The problem was a bad GPU card (see the comments). [I'm Adding this answer to remove the question from the unanswered list and make it more useful.]

Upvotes: 2

Related Questions