Reputation: 1021
I am puzzled by the behaviour of the following snippet:
#include <stdio.h>
// kernel
__global__ void CheckAddressing(float * d_Result, int numCols, int numRows)
{
printf("%d\n", threadIdx.x);
if(threadIdx.x<16)
{
d_Result[threadIdx.x]=float(364.66);
}
}
////////
int main(int argc, char ** argv)
{
int TotalSize = 16;
float * d_Result;
float * h_Result;
cudaSetDevice(0);
h_Result = (float *)malloc(TotalSize*sizeof(float));
cudaMalloc((void **) &d_Result, TotalSize*sizeof(float));
CheckAddressing<<<dim3(1),dim3(16)>>>(d_Result, 8,8);
cudaMemcpy(h_Result, d_Result, TotalSize*sizeof(float), cudaMemcpyDeviceToHost);
for(int n=0; n<16; n++)
{
printf("%f\t", h_Result[n]);
}
printf("\n");
// free GPU memory
cudaFree(d_Result);
free(h_Result);
return 0;
}
It works on one machine (I compile with nvcc -arch=sm_30
) and returns 364.66 (16 times). However on another machine running Cuda 5.5 it returns all zeros. Any idea what can be happening?
UPDATE:
cuda-memcheck ./test
========= CUDA-MEMCHECK
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
========= ERROR SUMMARY: 0 errors
nvidia-smi
Fri Apr 18 14:45:05 2014
+------------------------------------------------------+
| NVIDIA-SMI 331.44 Driver Version: 331.44 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20Xm Off | 0000:02:00.0 Off | 0 |
| N/A 20C P0 50W / 235W | 11MiB / 5759MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| No running compute processes found |
+-----------------------------------------------------------------------------+
Upvotes: 0
Views: 1527
Reputation: 151799
Dirac mentions Fermi GPUs on its banner. If you are on a node with Fermi GPUs, your compile command is incorrect:
-arch=sm_30
is used for Kepler GPUs.
Try:
-arch=sm_20
instead.
I was confused by the fact that cuda-memcheck
was reporting no errors, but the type of error you are encountering is a type that cuda-memcheck
will not necessarily catch. Specifically, there are a category of launch failure errors that can only be trapped by the proper cuda error checking that @talonmies suggested. Specifically note the error checking code that is required immediately after a kernel launch.
When you compile for -arch=sm_30
and try to run it on a Fermi (sm_20
) machine, the kernel launch will immediately fail, but all other subsequent CUDA API calls will report no failure.
The detail page for Dirac does mention a couple Kepler nodes/GPUs:
•1 node: Tesla K20Xm
•1 node: Tesla K40c
I believe your code compiled with -arch=sm_35
should run correctly on those nodes.
And I also note that there are even some older ("Tesla" family) GPUs/nodes:
•4 nodes: 1 C1060 NVIDIA Tesla GPU with 4GB of memory and 240 parallel CUDA processor cores.
• 1 node: 4 C1060 Nvidia Tesla GPU's, each with 4GB of memory and 240 parallel CUDA processor cores.
For those nodes, you would need to compile with:
-arch=sm_13
but don't forget to use the proper cuda error checking, any time you are having difficulty with a CUDA code.
Or you could use nvcc extended notation to specify a compile and binary/executable for all 3 types.
Using extended notation, for the 3 different GPU architectures on that cluster (that I can see):
nvcc -gencode arch=compute_13,code=sm_13 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_35,code=sm_35 ...
Upvotes: 1