Reputation: 275
I have the following simple piece of code:
#include <stdio.h>
__global__ void loop()
{
int i=threadIdx.x + blockIdx.x * blockDim.x;
printf("This is iteration number %d\n", i);
}
int main()
{
int N = 10;
loop<<<10,10>>>();
cudaDeviceSynchronize();
}
When running it, I got the following:
This is iteration number 20
This is iteration number 21
This is iteration number 22
This is iteration number 23
This is iteration number 24
This is iteration number 25
This is iteration number 26
This is iteration number 27
This is iteration number 28
This is iteration number 29
This is iteration number 70
This is iteration number 71
This is iteration number 72
This is iteration number 73
This is iteration number 74
This is iteration number 75
This is iteration number 76
This is iteration number 77
This is iteration number 78
This is iteration number 79
This is iteration number 0
This is iteration number 1
This is iteration number 2
This is iteration number 3
This is iteration number 4
This is iteration number 5
This is iteration number 6
This is iteration number 7
This is iteration number 8
This is iteration number 9
This is iteration number 50
This is iteration number 51
This is iteration number 52
This is iteration number 53
This is iteration number 54
This is iteration number 55
This is iteration number 56
This is iteration number 57
This is iteration number 58
This is iteration number 59
This is iteration number 10
This is iteration number 11
This is iteration number 12
This is iteration number 13
This is iteration number 14
This is iteration number 15
This is iteration number 16
This is iteration number 17
This is iteration number 18
This is iteration number 19
This is iteration number 60
This is iteration number 61
This is iteration number 62
This is iteration number 63
This is iteration number 64
This is iteration number 65
This is iteration number 66
This is iteration number 67
This is iteration number 68
This is iteration number 69
This is iteration number 30
This is iteration number 31
This is iteration number 32
This is iteration number 33
This is iteration number 34
This is iteration number 35
This is iteration number 36
This is iteration number 37
This is iteration number 38
This is iteration number 39
This is iteration number 80
This is iteration number 81
This is iteration number 82
This is iteration number 83
This is iteration number 84
This is iteration number 85
This is iteration number 86
This is iteration number 87
This is iteration number 88
This is iteration number 89
This is iteration number 40
This is iteration number 41
This is iteration number 42
This is iteration number 43
This is iteration number 44
This is iteration number 45
This is iteration number 46
This is iteration number 47
This is iteration number 48
This is iteration number 49
This is iteration number 90
This is iteration number 91
This is iteration number 92
This is iteration number 93
This is iteration number 94
This is iteration number 95
This is iteration number 96
This is iteration number 97
This is iteration number 98
This is iteration number 99
As you see, it sees that threads in the same block are executing sequentially (in order), for example: numbers 0 to 9 can appear anywhere but between them, 0 will always appear first then 1 then 2, etc... Is this a coincidence (I tried running multiple times and got the same result) or do threads in the same block always execute sequentially?
Upvotes: 0
Views: 206
Reputation: 152173
CUDA threads can execute in any order (unless you explicitly control order). That is the mental model that all CUDA programmers should have, and trying to base programming behavior on other observations or principles may be risky.
Within the device, currently, there is the notion of the warp which represents threads executing in lockstep. Currently the warp size is 32, meaning the first 10 threads in each block will all belong to the same warp, and therefore are executing in lockstep. Therefore printf
calls within the same block happen to belong to the same warp in your case.
Therefore when any thread in the warp is calling the printf
routine, they all are. This simultaneous activity has to manifest itself in your output somehow, and you are simply observing that the manifest pattern is the same in each case.
That doesn't mean serialization of execution is going on in general, but there may be in how printf
handles simultaneous threads in a warp. Given that all the printf
output temporarily goes into the same buffer, it seems logical that this funnelling of output from 10 threads in a warp to a single buffer probably follows some pattern, and "random" doesn't seem like a logical or likely choice for that pattern, At least we can say the behavior seems consistent, which is unsurprising.
Since the details of the device printf
implementation are largely undocumented, I think it may not be very reliable to use printf
to discover device behavioral details.
Upvotes: 4