Reputation: 401

CUPTI blocks CUDA kernel launch in multi-threaded code

I am using CUDA 4.1 with CUPTI on Tesla C2070.

The code has 2 threads. The first thread launches a long kernel and waits on cudaDeviceSynchronize(), then the second thread starts a small kernel.

I have subscribed for CUPTI_RUNTIME_TRACE_CBID_cudaConfigureCall_v3020 and UPTI_RUNTIME_TRACE_CBID_cudaLaunch_v3020.

This causes the launch of the second kernel to be blocked until the first thread has finished cudaDeviceSynchronize(). Second thread does not return from the cudaConfigureCall() until the first thread finishes cudaDeviceSynchronize().

If I do not subscribe with CUPTI, this does not happen. This looks like a nasty performance bug with CUPTI.

The call stack below shows the status of each thread. I have attached the code with this post.

(gdb) info threads
  4 Thread 0x7f731467c710 (LWP 29708)  0x00000037f4ada083 in select () from /lib64/libc.so.6
  3 Thread 0x7f7312b50710 (LWP 29709)  0x00007f7314d7e3a6 in ?? () from /usr/lib64/libcuda.so.1
  2 Thread 0x7f731214f710 (LWP 29710)  0x00000037f4ac88d7 in sched_yield () from /lib64/libc.so.6
* 1 Thread 0x7f731477e720 (LWP 29707)  0x00000037f520803d in pthread_join () from /lib64/libpthread.so.0
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f731214f710 (LWP 29710))]#0  0x00000037f4ac88d7 in sched_yield () from /lib64/libc.so.6
(gdb) bt
#0  0x00000037f4ac88d7 in sched_yield () from /lib64/libc.so.6
#1  0x00007f73149fb73c in ?? () from /usr/local/cuda/extras/CUPTI/lib64/libcupti.so.4
#2  0x00007f7314dabac3 in ?? () from /usr/lib64/libcuda.so.1
#3  0x00007f7314db1020 in ?? () from /usr/lib64/libcuda.so.1
#4  0x00007f73147bbee8 in cudaConfigureCall () from /usr/local/cuda/lib64/libcudart.so.4
#5  0x000000000040110f in Thread2 () at event_sampling.cu:121
#6  0x00000037f52077e1 in start_thread () from /lib64/libpthread.so.0
#7  0x00000037f4ae152d in clone () from /lib64/libc.so.6
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f7312b50710 (LWP 29709))]#0  0x00007f7314d7e3a6 in ?? () from /usr/lib64/libcuda.so.1
(gdb) bt
#0  0x00007f7314d7e3a6 in ?? () from /usr/lib64/libcuda.so.1
#1  0x00007f7314d36b5a in ?? () from /usr/lib64/libcuda.so.1
#2  0x00007f7314d08976 in ?? () from /usr/lib64/libcuda.so.1
#3  0x00007f7314d396a3 in ?? () from /usr/lib64/libcuda.so.1
#4  0x00007f7314d39a06 in ?? () from /usr/lib64/libcuda.so.1
#5  0x00007f7314d08a29 in ?? () from /usr/lib64/libcuda.so.1
#6  0x00007f7314cfb830 in ?? () from /usr/lib64/libcuda.so.1
#7  0x00007f7314cdafa4 in ?? () from /usr/lib64/libcuda.so.1
#8  0x00007f731478ea13 in ?? () from /usr/local/cuda/lib64/libcudart.so.4
#9  0x00007f73147c3827 in cudaDeviceSynchronize () from /usr/local/cuda/lib64/libcudart.so.4
#10 0x0000000000400fe2 in Thread1 (ip=0x0) at event_sampling.cu:101
#11 0x00000037f52077e1 in start_thread () from /lib64/libpthread.so.0
#12 0x00000037f4ae152d in clone () from /lib64/libc.so.6
(gdb) thread 4
[Switching to thread 4 (Thread 0x7f731467c710 (LWP 29708))]#0  0x00000037f4ada083 in select () from /lib64/libc.so.6
(gdb) bt
#0  0x00000037f4ada083 in select () from /lib64/libc.so.6
#1  0x00007f731524147b in ?? () from /usr/lib64/libcuda.so.1
#2  0x00007f7314d45d9b in ?? () from /usr/lib64/libcuda.so.1
#3  0x00007f7315242819 in ?? () from /usr/lib64/libcuda.so.1
#4  0x00000037f52077e1 in start_thread () from /lib64/libpthread.so.0
#5  0x00000037f4ae152d in clone () from /lib64/libc.so.6
(gdb)

CODE

        /*
 * Copyright 2011 NVIDIA Corporation. All rights reserved
 *
 * Sample app to demonstrate use of CUPTI library to obtain profiler
 * event values by sampling.
 */


#include <stdio.h> 
#include <cuda.h>
#include <cupti.h>
#include <unistd.h>
#include <pthread.h>

#define CHECK_CU_ERROR(err, cufunc)                                     \
  if (err != CUDA_SUCCESS)                                              \
    {                                                                   \
      printf ("Error %d for CUDA Driver API function '%s'.\n",          \
              err, cufunc);                                             \
      exit(-1);                                                         \
    }


#define N 100000


static CUcontext context;
static CUdevice device;
static char *eventName;


// Device code
__global__ void VecAdd(const int* A, const int* B, int* C, int size)
{
  int i = blockDim.x * blockIdx.x + threadIdx.x;
  for(long long m = 0 ; m < 100; m ++)
  for(long long n = 0 ; n < 100000 ; n ++)
  if (i < size)
    C[i] = A[i] + B[i];
}
static void
initVec(int *vec, int n)
{
  for (int i=0; i< n; i++)
    vec[i] = i;
}


// Device code
__global__ void VecSub(const int* A, const int* B, int* C, int size)
{
  int i = blockDim.x * blockIdx.x + threadIdx.x;
  for(long long n = 0 ; n < 100000 ; n ++)
  if (i < size)
    C[i] = A[i] - B[i];
}

int *d_A; int *d_B; int *d_C;


cudaStream_t stream[2];
pthread_t threads[2];

static void *
Thread1(void * ip)
{
fprintf(stderr, "\n Thread1 started");
  size_t size = N * sizeof(int);
  int threadsPerBlock = 0;
  int blocksPerGrid = 0;
  int sum, i;
  int *h_A, *h_B, *h_C;

  // Allocate input vectors h_A and h_B in host memory
  h_A = (int*)malloc(size);
  h_B = (int*)malloc(size);
  h_C = (int*)malloc(size);

  // Initialize input vectors
  initVec(h_A, N);
  initVec(h_B, N);
  memset(h_C, 0, size);

  // Allocate vectors in device memory
  cudaMalloc((void**)&d_A, size);
  cudaMalloc((void**)&d_B, size);
  cudaMalloc((void**)&d_C, size);


  // Copy vectors from host memory to device memory
  cudaMemcpyAsync(d_A, h_A, size, cudaMemcpyHostToDevice,stream[0]);
  cudaMemcpyAsync(d_B, h_B, size, cudaMemcpyHostToDevice,stream[0]);

  threadsPerBlock = 256;
  blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

  fprintf(stderr,"\n Kernel Launch Thread1"); fflush(stderr);
  VecAdd<<<blocksPerGrid, threadsPerBlock,0 , stream[0]>>>(d_A, d_B, d_C, N);
  fprintf(stderr,"\n Kernel Launched Thread1");fflush(stderr);
  fprintf(stderr,"\n Start cudaDeviceSynchronize Thread1");fflush(stderr);
  cudaDeviceSynchronize();
  fprintf(stderr,"\n End cudaDeviceSynchronize Thread1");fflush(stderr);
  return 0;
}

static void * Thread2(void *)
{
    sleep(5);

fprintf(stderr,"\n Thread2 started");
  size_t size = N * sizeof(int);
  int threadsPerBlock = 0;
  int blocksPerGrid = 0;
  int sum, i;
  int *h_A, *h_B, *h_C;

  threadsPerBlock = 256;
  blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

  fprintf(stderr,"\n Kernel Launch Thread2");fflush(stderr);
  VecSub<<<blocksPerGrid, threadsPerBlock,0 , stream[1]>>>(d_A, d_B, d_C, N);
  fprintf(stderr,"\n Kernel Launched Thread2");fflush(stderr);
  fprintf(stderr,"\n Start cudaDeviceSynchronize Thread2");fflush(stderr);
  cudaDeviceSynchronize();
  fprintf(stderr,"\n End cudaDeviceSynchronize Thread2");fflush(stderr);
  return 0;


}

void CUPTIAPI CallBack(void *userdata, CUpti_CallbackDomain domain, CUpti_CallbackId cbid, const void *cbData)
{
 uint32_t streamId = 0;
 const CUpti_CallbackData * cbInfo = (const CUpti_CallbackData *) cbData;
 if(cbid == CUPTI_RUNTIME_TRACE_CBID_cudaConfigureCall_v3020 && cbInfo->callbackSite == CUPTI_API_ENTER) {
  fprintf(stderr,"\n Event created");
    cudaConfigureCall_v3020_params * params = (cudaConfigureCall_v3020_params *) cbInfo->functionParams;
    cuptiGetStreamId(cbInfo->context, (CUstream) params->stream, &streamId);
    printf("\n stream %d", streamId);

 }

}

int
main(int argc, char *argv[])
{


  CUresult err;


  cudaStreamCreate(&stream[0]);
  cudaStreamCreate(&stream[1]);

#if 1
 CUpti_SubscriberHandle subscriber;
 cuptiSubscribe(&subscriber, (CUpti_CallbackFunc) CallBack, 0);
 cuptiEnableCallback(1,subscriber, CUPTI_CB_DOMAIN_RUNTIME_API, CUPTI_RUNTIME_TRACE_CBID_cudaLaunch_v3020);
 cuptiEnableCallback(1,subscriber, CUPTI_CB_DOMAIN_RUNTIME_API, CUPTI_RUNTIME_TRACE_CBID_cudaConfigureCall_v3020);
#endif



    cudaDeviceSynchronize();

    pthread_create(&threads[0],0,Thread1,0);
    pthread_create(&threads[1],0,Thread2,0);





    pthread_join(threads[0],0);
    pthread_join(threads[1],0);

  fprintf(stderr,"\n --------------over -----------");
  return 0;
}

Upvotes: 1

Answers (2)

deadeye

Reputation: 121

CUPTI has two general modes, activity collection and event collection.

With event collection all kernel launches are serialized across the entire application. This is done because limitations in the hardware performance counters require that, to get accurate measurement for a kernel, that only that single kernel be executing on the device.

With activity collection, CUPTI tries to perturb the application behavior as little as possible. The goal is to observe the behavior of the GPU as accurately as possible.

You are correct that there is a bug/limitation in CUPTI that causes cudaDeviceSync() (and other sync functions) to block cuda calls on other threads. This is a known issue during activity collection (as it obviously breaks the primary goal of low-impact observation) and should be resolved in a future release.

Also, not from your question but mentioned in one of the answers, is the issue of concurrent kernel execution (that is having two or more kernels executing simultaneously on the device). CUPTI does disable concurrent kernel execution in all modes. This too is a known issue and will be resolve in an upcoming release.

Upvotes: 0

harrism

Reputation: 27899

This is probably caused by the use of cudaDeviceSynchronize() in both threads. cudaDeviceSynchronize() forces the whole device to finish all previously issued commands before any subsequent commands proceed. It is a heavy hammer; use it sparingly.

I suggest cudaStreamSynchronize() in this case instead. If you need one stream to wait on another, use a cudaEvent and cudaStreamWaitEvent().

CUPTI will only disable concurrent kernels if events are collected using CUPTI_EVENT_COLLECTION_MODE_KERNEL. Enabling profiling (whether via environment variables, visual profiler, or this mode of CUPTI) disables concurrent kernel execution.

Upvotes: 1

CUPTI blocks CUDA kernel launch in multi-threaded code

Answers (2)

Related Questions