Reputation: 1
In my cuda program of runtime ,the cpu and gpu can compute Asynchronously,but not cooperatively, Why?
I measuring the time of the program ,the total time is the sum time of cpu compute time and gpu compute time .Through the visual profile, I find the gpu don't compute until the cpu complete. My purpose is that the cpu compute as the same time the gpu compute.
Platform:
window 10
cuda 7.5
vs2013
Code compiled in debug mode(no optimizing)
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include<time.h>
__global__ void addKernel()
{
int a ;
for (int i = 0; i < 10000;i++)
for (int j = 0; j < 10000;j++)
a = i;
}
void comput()
{
int a = 1;
for (int i = 0; i < 10000;i++)
for (int j = 0; j < 10000; j++)
{
for (int k = 0; k < 100;k++)
a = j;
}
}
int main()
{
cudaSetDevice(0);
cudaEvent_t start, stop1;
cudaEventCreate(&start);
cudaEventCreate(&stop1);
clock_t ss = clock();
cudaEventRecord(start,0);
addKernel<<<1,64>>>();
cudaEventRecord(stop1,0);
clock_t ct = clock();
comput();
clock_t ctt = clock();
cudaEventSynchronize(stop1);
cudaDeviceSynchronize();
clock_t sss = clock();
float t1;
cudaEventElapsedTime(&t1, start, stop1);
printf("clock GPU :%.4f s\n", t1/1000);
printf("clock cpu:%f s\n",(float) (ctt - ct)/CLOCKS_PER_SEC);
printf("clock total time: %f s\n", (float)(sss - ss) / CLOCKS_PER_SEC);
cudaEventDestroy(start);
cudaEventDestroy(stop1);
cudaDeviceReset();
}
Upvotes: 0
Views: 347
Reputation: 7245
There are a couple of issues (potentially) at play here:
If you are using the WDDM driver (as opposed to the TCC driver), kernel launches get batched up to reduce the effect of the WDDM driver's higher launch overhead. This means the driver will postpone the launch of addKernel()
waiting for more work, up until it encounters the cudaEventSynchronize()
call.¹⁾ However by this time comput()
has already finished.
So in your example CPU and GPU work indeed do not run in parallel, however addKernel()
on the GPU actually runs after comput()
on the CPU.
You can prevent (further) batching and force immediate launch of addKernel()
by inserting a call to cudaStreamQuery(0)
before calling comput()
.
addKernel()
and compute()
have no externally visible effect (they only set the local variable a
) and may be completely optimised away by the compiler. Compiling in debug mode might not prevent all of these optimizations. This would make it harder to demonstrate asynchronous execution as you are only measuring the kernel launch and timing overhead.
So replace them with code that performs real work, like summing a vector, and store the result to a global variable.
As halfelf has pointed out in his answer, the profiler may launch kernels synchronously under certain conditions.
¹⁾ If no further work comes up in a while, the wait might also time out and addKernel()
may get launched before the cudaEventSynchronize()
call.
Upvotes: 2
Reputation: 10097
From cuda programming guide :
Programmers can globally disable asynchronicity of kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is provided for debugging purposes only and should not be used as a way to make production software run reliably.
Kernel launches are synchronous if hardware counters are collected via a profiler (Nsight, Visual Profiler) unless concurrent kernel profiling is enabled. Async memory copies will also be synchronous if they involve host memory that is not page-locked.
Also if there is no optimization, the host function will run a very long time, maybe million times of the kernel. If with optimization, it will run nothing actually and returns immediately.
I suggest trying to run your binary with CUDA_LAUNCH_BLOCKING=1
and CUDA_LAUNCH_BLOCKING=0
respectively to test running time. Also modify your kernel and host function to a meaningful one.
Upvotes: 1