Reputation: 1351
Code:
__global__ void K1() {
int p=1;
for(int i=0; i<100000; ++i)
for(int j=0; j<100000; ++j)
p*=(i+100)*j;
printf("K1\n");
}
__global__ void K2() {
printf("K2\n");
}
int main() {
int *ptr;
cudaStream_t s1, s2;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);
K1<<<1, 1, 0, s1>>>();
cudaHostAlloc(&ptr, 1, 0);
K2<<<1, 1, 0, s2>>>();
cudaDeviceSynchronize();
return 0;
}
Output:
K2
K1
Questions:
s1
same as default stream?Two commands from different streams cannot run concurrently if any one of the following operations is issued in-between them by the host thread:
- a page-locked host memory allocation,
shouldn't K2
start after K1
finish?
Upvotes: 1
Views: 254
Reputation: 2926
On the first point, s1 is not the default stream.
On the second point, running you code on a small system, I got the following profiler timeline.
My understanding is that there is a delay between the scheduling of a kernel launch and its actual launch. This is no surprise as launch are asynchronous with streams. As a result, the cudaHostAlloc
occurs before any kernel launch. It occurs inbetween kernel call schedules, BUT before any kernel call.
Upvotes: 1