Different timing indicated from two kind of timers

Question

I’m trying to use two kind of timers to measure the run time of a GPU kernel. As the code indicated below, I have cudaEventRecord measuring the overall kernel and inside the kernel I have clock() functions. However, the output of the code shows that two timers got different measurements:

gpu freq = 1530000 khz

Hello from block 0, thread 0

kernel runtime: 0.0002453 seconds

kernel cycle: 68194

According to results, the kernel elapsed 68194 clock cycles, the corresponded time should be 68194/1530000000 = 0.00004457124 seconds. But the cudaEventRecorder showed 0.0002453 seconds. Could anyone explain why? Thank you.

============================

#include 
#include 
#include 
__global__ void add(int *runtime)
{
  clock_t start_time = clock();
  printf("Hello from block %d, thread %d
", blockIdx.x, threadIdx.x);
  clock_t end_time = clock();
  *runtime = (int)(end_time - start_time);
}

int main(void)
{
  int *runtime;

  cudaDeviceProp prop;
  int result = cudaGetDeviceProperties(&prop, 0);
  printf("gpu freq = %d khz
", prop.clockRate);

  cudaMallocManaged(&runtime, sizeof(int));
  *runtime = 0;

  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);
  cudaEventRecord(start);
  add<<<1, 1>>>(runtime);
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaStreamSynchronize(0);

  float miliseconds = 0.f;
  cudaEventElapsedTime(&miliseconds, start, stop);
  float seconds = miliseconds / 1000.f;

  printf("kernel runtime: %.7f seconds
", seconds);
  cudaDeviceSynchronize();
  printf("kernel cycle: %d
", *runtime);

  cudaFree(runtime);

  return 0;
}

Robert Crovella · Accepted Answer

I wouldn't use managed memory for this kind of work, if I could avoid it. It introduces a lot of complexity (unless you like that sort of thing).

To understand managed memory performance, its important to know which GPU you are running on, which CUDA version, and also which OS (CentOS).

I'm running on a Tesla V100 on CentOS with CUDA 10.1.243 and I see large variability (on the order of 3x to 10x) run-to-run. I attribute this to the demand-paging of memory that is going on.

Let's take a look at my SASS code:

$ nvcc -arch=sm_70 -o t1627 t1627.cu
$ cuobjdump -sass ./t1627

Fatbin elf code:
================
arch = sm_70
code version = [1,7]
producer = 
host = linux
compile_size = 64bit

        code for sm_70

Fatbin elf code:
================
arch = sm_70
code version = [1,7]
producer = 
host = linux
compile_size = 64bit

        code for sm_70
                Function : _Z3addPi
        .headerflags    @"EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)"
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;   /* 0x00000a00ff017624 */
                                                                             /* 0x000fd000078e00ff */
        /*0010*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;             /* 0x000000fffffff389 */
                                                                             /* 0x000fe200000e00ff */
        /*0020*/                   IADD3 R1, R1, -0x8, RZ ;                  /* 0xfffffff801017810 */
                                                                             /* 0x000fc80007ffe0ff */
        /*0030*/                   IADD3 R6, P0, R1, c[0x0][0x20], RZ ;      /* 0x0000080001067a10 */
                                                                             /* 0x000fca0007f1e0ff */
        /*0040*/                   IMAD.X R7, RZ, RZ, c[0x0][0x24], P0 ;     /* 0x00000900ff077624 */
                                                                             /* 0x000fd000000e06ff */
        /*0050*/                   CS2R.32 R2, SR_CLOCKLO ;                  //begin in-kernel timing
                                                                             /* 0x000fd00000005000 */
        /*0060*/                   S2R R9, SR_TID.X ;                        /* 0x0000000000097919 */
                                                                             /* 0x000e220000002100 */
        /*0070*/                   MOV R4, 0x0 ;                             /* 0x0000000000047802 */
                                                                             /* 0x000fe40000000f00 */
        /*0080*/                   MOV R5, 0x0 ;                             /* 0x0000000000057802 */
                                                                             /* 0x000fe20000000f00 */
        /*0090*/                   S2R R8, SR_CTAID.X ;                      /* 0x0000000000087919 */
                                                                             /* 0x000e280000002500 */
        /*00a0*/                   STL.64 [R1], R8 ;                         /* 0x0000000801007387 */
                                                                             /* 0x0011e60000100a00 */
        /*00b0*/                   MOV R20, 0x0 ;                            /* 0x0000000000147802 */
                                                                             /* 0x000fe40000000f00 */
        /*00c0*/                   MOV R21, 0x0 ;                            /* 0x0000000000157802 */
                                                                             /* 0x000fd00000000f00 */
        /*00d0*/                   CALL.ABS.NOINC 0x0 ;                      //printf call
                                                                             /* 0x001fea0003c00000 */
        /*00e0*/                   CS2R.32 R5, SR_CLOCKLO ;                  //end in-kernel timing
                                                                             /* 0x000fd00000005000 */
        /*00f0*/                   IMAD.IADD R5, R5, 0x1, -R2 ;              /* 0x0000000105057824 */
                                                                             /* 0x000fe400078e0a02 */
        /*0100*/                   IMAD.MOV.U32 R2, RZ, RZ, c[0x0][0x160] ;  // set up managed address
                                                                             /* 0x000fc400078e00ff */
        /*0110*/                   IMAD.MOV.U32 R3, RZ, RZ, c[0x0][0x164] ;  /* 0x00005900ff037624 */
                                                                             /* 0x000fd000078e00ff */
        /*0120*/                   STG.E.SYS [R2], R5 ;                      // first (only) touch on managed allocation
                                                                             /* 0x000fe2000010e900 */
        /*0130*/                   EXIT ;                                    /* 0x000000000000794d */
                                                                             /* 0x000fea0003800000 */
        /*0140*/                   BRA 0x140;                                /* 0xfffffff000007947 */
                                                                             /* 0x000fc0000383ffff */
        /*0150*/                   NOP;                                      /* 0x0000000000007918 */
                                                                             /* 0x000fc00000000000 */
        /*0160*/                   NOP;                                      /* 0x0000000000007918 */
                                                                             /* 0x000fc00000000000 */
        /*0170*/                   NOP;                                      /* 0x0000000000007918 */
                                                                             /* 0x000fc00000000000 */
                ...................



Fatbin ptx code:
================
arch = sm_70
code version = [6,4]
producer = 
host = linux
compile_size = 64bit
compressed
$

I've added some comments above. The in-kernel timing region (where you read clock() in the source code) is delineated at lines 0050 and 00e0. After line 00e0, (so, after you have finished the in-kernel timing) you are touching the managed allocation runtime, to store the result, on line 0120.

In my case, I have a Tesla V100, with CUDA 10.1.243 on CentOS 7. This is a demand-paged regime for unified memory. In that case, the first touch to a managed allocation will trigger a page fault. The page fault is serviced by a complex interaction between the host operating system and the CUDA runtime (effectively the device operating system). This page fault servicing will take place outside of your in-kernel timing measurement, but will be measured by kernel-level timing (i.e. it impacts kernel duration) such as cuda event based timing, or profilers.

If I modify your code to use an ordinary device allocation, the large runtime variability goes away. If I make some additional changes for what I consider to be good benchmarking practice (such as performing a warm-up run), I find that the numbers correspond to each other somewhat better:

$ cat t1627.cu
#include 
#include 
#include 
__global__ void add(int *runtime)
{
  clock_t start_time = clock();
  printf("Hello from block %d, thread %d
", blockIdx.x, threadIdx.x);
  clock_t end_time = clock();
  *runtime = (int)(end_time - start_time);
}

int main(void)
{
  int *runtime;

  cudaDeviceProp prop;
  int result = cudaGetDeviceProperties(&prop, 0);
  printf("gpu freq = %d khz
", prop.clockRate);

  cudaMalloc(&runtime, sizeof(int));
  cudaMemset(runtime, 0, sizeof(int));

  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);
  add<<<1, 1>>>(runtime);
  cudaDeviceSynchronize();
  cudaEventRecord(start);
  add<<<1, 1>>>(runtime);
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);

  float miliseconds = 0.f;
  cudaEventElapsedTime(&miliseconds, start, stop);
  float seconds = miliseconds / 1000.f;

  printf("kernel runtime: %f s 
", seconds);
  int h_runtime;
  cudaMemcpy(&h_runtime, runtime, sizeof(int), cudaMemcpyDeviceToHost);
  printf("kernel cycle: %d
", h_runtime);

  cudaFree(runtime);

  return 0;
}
$ nvcc -arch=sm_70 -o t1627 t1627.cu
$ ./t1627
gpu freq = 1380000 khz
Hello from block 0, thread 0
Hello from block 0, thread 0
kernel runtime: 0.000059 s
kernel cycle: 57376
$ nvprof ./t1627
==28252== NVPROF is profiling process 28252, command: ./t1627
gpu freq = 1380000 khz
Hello from block 0, thread 0
Hello from block 0, thread 0
kernel runtime: 0.000069 s
kernel cycle: 58997
==28252== Profiling application: ./t1627
==28252== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   96.49%  109.00us         2  54.497us  49.569us  59.426us  add(int*)
                    1.93%  2.1760us         1  2.1760us  2.1760us  2.1760us  [CUDA memcpy DtoH]
                    1.59%  1.7920us         1  1.7920us  1.7920us  1.7920us  [CUDA memset]
      API calls:   96.20%  329.20ms         1  329.20ms  329.20ms  329.20ms  cudaMalloc
                    1.58%  5.4205ms         4  1.3551ms  695.98us  3.3263ms  cuDeviceTotalMem
                    1.56%  5.3336ms       388  13.746us     357ns  614.73us  cuDeviceGetAttribute
                    0.35%  1.1925ms         1  1.1925ms  1.1925ms  1.1925ms  cudaGetDeviceProperties
                    0.13%  435.16us         4  108.79us  103.50us  114.98us  cuDeviceGetName
                    0.07%  235.87us         1  235.87us  235.87us  235.87us  cudaFree
                    0.03%  114.74us         2  57.371us  17.808us  96.935us  cudaLaunchKernel
                    0.03%  88.291us         1  88.291us  88.291us  88.291us  cudaDeviceSynchronize
                    0.02%  59.720us         1  59.720us  59.720us  59.720us  cudaEventSynchronize
                    0.01%  35.692us         1  35.692us  35.692us  35.692us  cudaMemcpy
                    0.01%  26.655us         4  6.6630us  3.8710us  11.334us  cuDeviceGetPCIBusId
                    0.01%  26.631us         1  26.631us  26.631us  26.631us  cudaMemset
                    0.00%  16.933us         2  8.4660us  5.9710us  10.962us  cudaEventRecord
                    0.00%  8.8200us         8  1.1020us     449ns  1.8970us  cuDeviceGet
                    0.00%  8.5660us         2  4.2830us  1.0320us  7.5340us  cudaEventCreate
                    0.00%  4.0930us         3  1.3640us     390ns  2.3880us  cuDeviceGetCount
                    0.00%  3.6490us         1  3.6490us  3.6490us  3.6490us  cudaEventElapsedTime
                    0.00%  2.9010us         4     725ns     547ns     900ns  cuDeviceGetUuid
$

in kernel: 57376/1380000000 = 41.5us

event: 69us

nvprof: 49.57us

note that the in-kernel measurement that is being calculated above assumes the GPU is running at basically its max clock rate. This isn't always the case, and may not be typically the case. Therefore, the implied measurement could be higher than what is calculated above (if the effective clock rate is lower than max). All recent GPUs have variable clocking schemes.

Different timing indicated from two kind of timers

Answers (1)

Related Questions