CUDA memory error

Question

I run high-performance calculations on multiple GPUs (two GPUs per machine), currently I test my code on GeForce GTX TITAN. Recently I noticed that random memory errors occur so that I can't rely on the outcome anymore. Tried to debug and ran into things I don't understand. I'd appreciate if someone could help me understand why the following is happening.

So, here's my GPU:

$ nvidia-smi -a
Driver Version                      : 331.67
GPU 0000:03:00.0
    Product Name                    : GeForce GTX TITAN
    ...
    VBIOS Version                   : 80.10.2C.00.02
    FB Memory Usage
        Total                       : 6143 MiB
        Used                        : 14 MiB
        Free                        : 6129 MiB
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A

My Linux machine (Ubuntu 12.04 64-bit):

$ uname -a
Linux cluster-cn-211 3.2.0-61-generic #93-Ubuntu SMP Fri May 2 21:31:50 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Here's my code (basically, allocate 4G of memory, fill with zeros, copy back to host and check if all values are zero; spoiler: they're not)

#include 

#define check(e) {if (e != cudaSuccess) { \
        printf("%d: %s
", e, cudaGetErrorString(e)); \
        return 1; }}

int main() {
    size_t num  = 1024*1024*1024;      // 1 billion elements
    size_t size = num * sizeof(float); // 4 GB of memory
    float *dp;
    float *p = new float[num];
    cudaError_t e;

    e = cudaMalloc((void**)&dp, size); // allocate
    check(e);

    e = cudaMemset(dp, 0, size);       // set to zero
    check(e);

    e = cudaMemcpy(p, dp, size, cudaMemcpyDeviceToHost); // copy back
    check(e);

    for(size_t i=0; i



I run it like this

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Sat_Jan_25_17:33:19_PST_2014
Cuda compilation tools, release 6.0, V6.0.1

$ nvcc test.cu 
nvcc warning : The 'compute_10' and 'sm_10' architectures are deprecated, and may be removed in a future release.

$ ./a.out | head
516836128 -0.000214
516836164 -0.841684
516836328 -3272.289062
516836428 -644673853950867887966360388719607808.000000
516836692 0.000005
516850472 232680927002624.000000
516850508 909806289566040064.000000
...

$ echo $?
0


This is not what I expected: many elements are non-zero. Here a couple of observations


I checked with cuda-memcheck - no errors. Checked with valgrind's memcheck - no errors.
the memory allocation works as expected, nvidia-smi reports 4179MiB /  6143MiB
the same happens if I 

allocate less memory (e.g. 2 GB)
compile with -arch sm_30 or -arch compute_30 (see capabilities)
go from SDK version 6.0 back to 5.5
go from GTX Titan to Tesla K20c (here the ECC checking is enabled and all counters are zero); behavior is the same, I was able to test it on five different GPU cards.
allocate multiple smaller arrays on the device

the errors disappear if I test on a GTX 680


Again, the question is: why do I see those memory errors and how can I ensure that this never happens?

CUDA memory error

Answers (1)

Related Questions