Reputation: 225
I am facing the following problem on a GeForce GTX 580 (Fermi-class) GPU.
Just to give you some background, I am reading single-byte samples packed in the following manner in a file: Real(Signal 1), Imaginary(Signal 1), Real(Signal 2), Imaginary(Signal 2). (Each byte is a signed char, taking values between, -128 and 127.) I read these into a char4 array, and use the kernel given below to copy them to two float2 arrays corresponding to each signal. (This is just an isolated part of a larger program.)
When I run the program using cuda-memcheck, I get either an unqualified unspecified launch failure
, or the same message along with User Stack Overflow or Breakpoint Hit
or Invalid __global__ write of size 8
at random thread and block indices.
The main kernel and launch-related code is reproduced below. The strange thing is that this code works (and cuda-memcheck throws no error) on a non-Fermi-class GPU that I have access to. Another thing that I observed is that the Fermi gives no error for N
less than 16384.
#define N 32768
int main(int argc, char *argv[])
{
char4 *pc4Buf_h = NULL;
char4 *pc4Buf_d = NULL;
float2 *pf2InX_d = NULL;
float2 *pf2InY_d = NULL;
dim3 dimBCopy(1, 1, 1);
dim3 dimGCopy(1, 1);
...
/* i do check for errors in the actual code */
pc4Buf_h = (char4 *) malloc(N * sizeof(char4));
(void) cudaMalloc((void **) &pc4Buf_d, N * sizeof(char4));
(void) cudaMalloc((void **) &pf2InX_d, N * sizeof(float2));
(void) cudaMalloc((void **) &pf2InY_d, N * sizeof(float2));
...
dimBCopy.x = 1024; /* number of threads in a block, for my GPU */
dimGCopy.x = N / 1024;
CopyDataForFFT<<<dimGCopy, dimBCopy>>>(pc4Buf_d,
pf2InX_d,
pf2InY_d);
...
}
__global__ void CopyDataForFFT(char4 *pc4Data,
float2 *pf2FFTInX,
float2 *pf2FFTInY)
{
int i = (blockIdx.x * blockDim.x) + threadIdx.x;
pf2FFTInX[i].x = (float) pc4Data[i].x;
pf2FFTInX[i].y = (float) pc4Data[i].y;
pf2FFTInY[i].x = (float) pc4Data[i].z;
pf2FFTInY[i].y = (float) pc4Data[i].w;
return;
}
One other thing I noticed in my program is that if I comment out any two char-to-float assignment statements in my kernel, there's no memory error. One other thing I noticed in my program is that if I comment out either the first two or the last two char-to-float assignment statements in my kernel, there's no memory error. If I comment out one from the first two (pf2FFTInX
), and another from the second two (pf2FFTInY
), errors still crop up, but less frequently. The kernel uses 6 registers with all four assignment statements uncommented, and uses 5 4 registers with two assignment statements commented out.
I tried the 32-bit toolkit in place of the 64-bit toolkit, 32-bit compilation with the -m32
compiler option, running without X windows, etc. but the program behaviour is the same.
I use CUDA 4.0 driver and runtime (also tried CUDA 3.2) on RHEL 5.6. The GPU compute capability is 2.0.
Please help! I could post the entire code if anybody is interested in running it on their Fermi cards.
UPDATE: Just for the heck of it, I inserted a __syncthreads()
between the pf2FFTInX
and the pf2FFTInY
assignment statements, and memory errors disappeared for N
= 32768. But at N
= 65536, I still get errors.<--
This didn't last long. Still getting errors.
UPDATE: In continuing with the weird behaviour, when I run the program using cuda-memcheck, I get these 16x16 blocks of multi-coloured pixels distributed randomly all over my screen. This does not happen if I run the program directly.
Upvotes: 3
Views: 1087
Reputation: 27829
The problem was a bad GPU card (see the comments). [I'm Adding this answer to remove the question from the unanswered list and make it more useful.]
Upvotes: 2