Pavel
Pavel

Reputation: 7562

CUDA memory error

I run high-performance calculations on multiple GPUs (two GPUs per machine), currently I test my code on GeForce GTX TITAN. Recently I noticed that random memory errors occur so that I can't rely on the outcome anymore. Tried to debug and ran into things I don't understand. I'd appreciate if someone could help me understand why the following is happening.

So, here's my GPU:

$ nvidia-smi -a
Driver Version                      : 331.67
GPU 0000:03:00.0
    Product Name                    : GeForce GTX TITAN
    ...
    VBIOS Version                   : 80.10.2C.00.02
    FB Memory Usage
        Total                       : 6143 MiB
        Used                        : 14 MiB
        Free                        : 6129 MiB
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A

My Linux machine (Ubuntu 12.04 64-bit):

$ uname -a
Linux cluster-cn-211 3.2.0-61-generic #93-Ubuntu SMP Fri May 2 21:31:50 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Here's my code (basically, allocate 4G of memory, fill with zeros, copy back to host and check if all values are zero; spoiler: they're not)

#include <cstdio>

#define check(e) {if (e != cudaSuccess) { \
        printf("%d: %s\n", e, cudaGetErrorString(e)); \
        return 1; }}

int main() {
    size_t num  = 1024*1024*1024;      // 1 billion elements
    size_t size = num * sizeof(float); // 4 GB of memory
    float *dp;
    float *p = new float[num];
    cudaError_t e;

    e = cudaMalloc((void**)&dp, size); // allocate
    check(e);

    e = cudaMemset(dp, 0, size);       // set to zero
    check(e);

    e = cudaMemcpy(p, dp, size, cudaMemcpyDeviceToHost); // copy back
    check(e);

    for(size_t i=0; i<num; i++) {
        if (p[i] != 0)                   // this should never happen, amiright?
            printf("%lu %f\n", i, p[i]);
    }
    return 0;
}

I run it like this

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Sat_Jan_25_17:33:19_PST_2014
Cuda compilation tools, release 6.0, V6.0.1

$ nvcc test.cu 
nvcc warning : The 'compute_10' and 'sm_10' architectures are deprecated, and may be removed in a future release.

$ ./a.out | head
516836128 -0.000214
516836164 -0.841684
516836328 -3272.289062
516836428 -644673853950867887966360388719607808.000000
516836692 0.000005
516850472 232680927002624.000000
516850508 909806289566040064.000000
...

$ echo $?
0

This is not what I expected: many elements are non-zero. Here a couple of observations

  1. I checked with cuda-memcheck - no errors. Checked with valgrind's memcheck - no errors.
  2. the memory allocation works as expected, nvidia-smi reports 4179MiB / 6143MiB
  3. the same happens if I
    • allocate less memory (e.g. 2 GB)
    • compile with -arch sm_30 or -arch compute_30 (see capabilities)
    • go from SDK version 6.0 back to 5.5
    • go from GTX Titan to Tesla K20c (here the ECC checking is enabled and all counters are zero); behavior is the same, I was able to test it on five different GPU cards.
    • allocate multiple smaller arrays on the device
  4. the errors disappear if I test on a GTX 680

Again, the question is: why do I see those memory errors and how can I ensure that this never happens?

Upvotes: 3

Views: 2189

Answers (1)

diegonher
diegonher

Reputation: 11

I also perform calculations using CPU and we have found the same issue. We are using the model GeForce GTX 660 Ti.

I have check that the number of errors increases with the time which the GPU has been working. The problem can be solved by shutting down the computer (it doesn't work if the machine is rebooted), but after some time working the problem starts again. I have no idea why that happens. I have tried several codes to check the memory and all of them give the same result.

As far as I have checked this problem cannot be avoided, and the only way to be sure that your results are ok is to check the memory after the calculations and to shutdown the machine evey so often. I know this is not a good solution but is the only I have found.

Upvotes: 1

Related Questions