Reputation: 7562
I run high-performance calculations on multiple GPUs (two GPUs per machine), currently I test my code on GeForce GTX TITAN. Recently I noticed that random memory errors occur so that I can't rely on the outcome anymore. Tried to debug and ran into things I don't understand. I'd appreciate if someone could help me understand why the following is happening.
So, here's my GPU:
$ nvidia-smi -a
Driver Version : 331.67
GPU 0000:03:00.0
Product Name : GeForce GTX TITAN
...
VBIOS Version : 80.10.2C.00.02
FB Memory Usage
Total : 6143 MiB
Used : 14 MiB
Free : 6129 MiB
Ecc Mode
Current : N/A
Pending : N/A
My Linux machine (Ubuntu 12.04 64-bit):
$ uname -a
Linux cluster-cn-211 3.2.0-61-generic #93-Ubuntu SMP Fri May 2 21:31:50 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Here's my code (basically, allocate 4G of memory, fill with zeros, copy back to host and check if all values are zero; spoiler: they're not)
#include <cstdio>
#define check(e) {if (e != cudaSuccess) { \
printf("%d: %s\n", e, cudaGetErrorString(e)); \
return 1; }}
int main() {
size_t num = 1024*1024*1024; // 1 billion elements
size_t size = num * sizeof(float); // 4 GB of memory
float *dp;
float *p = new float[num];
cudaError_t e;
e = cudaMalloc((void**)&dp, size); // allocate
check(e);
e = cudaMemset(dp, 0, size); // set to zero
check(e);
e = cudaMemcpy(p, dp, size, cudaMemcpyDeviceToHost); // copy back
check(e);
for(size_t i=0; i<num; i++) {
if (p[i] != 0) // this should never happen, amiright?
printf("%lu %f\n", i, p[i]);
}
return 0;
}
I run it like this
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Sat_Jan_25_17:33:19_PST_2014
Cuda compilation tools, release 6.0, V6.0.1
$ nvcc test.cu
nvcc warning : The 'compute_10' and 'sm_10' architectures are deprecated, and may be removed in a future release.
$ ./a.out | head
516836128 -0.000214
516836164 -0.841684
516836328 -3272.289062
516836428 -644673853950867887966360388719607808.000000
516836692 0.000005
516850472 232680927002624.000000
516850508 909806289566040064.000000
...
$ echo $?
0
This is not what I expected: many elements are non-zero. Here a couple of observations
cuda-memcheck
- no errors. Checked with valgrind
's memcheck
- no errors.nvidia-smi
reports 4179MiB / 6143MiB
-arch sm_30
or -arch compute_30
(see capabilities)Again, the question is: why do I see those memory errors and how can I ensure that this never happens?
Upvotes: 3
Views: 2189
Reputation: 11
I also perform calculations using CPU and we have found the same issue. We are using the model GeForce GTX 660 Ti.
I have check that the number of errors increases with the time which the GPU has been working. The problem can be solved by shutting down the computer (it doesn't work if the machine is rebooted), but after some time working the problem starts again. I have no idea why that happens. I have tried several codes to check the memory and all of them give the same result.
As far as I have checked this problem cannot be avoided, and the only way to be sure that your results are ok is to check the memory after the calculations and to shutdown the machine evey so often. I know this is not a good solution but is the only I have found.
Upvotes: 1