Reputation: 53
I just started to code in CUDA and I'm trying to get my head around the concepts of how threads are executed and memory accessed in order to get the most out of the GPU. I read through the CUDA best practice guide, the book CUDA by Example and several posts here. I also found the reduction example by Mark Harris quite interesting and useful, but despite all the information I got rather confused on the details.
Let's assume we have a large 2D array (N*M) on which we do column-wise operations. I split the array into blocks so that each block has a number of threads that is a multiple of 32 (all threads fit into several warps). The first thread in each block allocates additional memory (a copy of the initial array, but only for the size of its own dimension) and shares the pointer using a _shared _ variable so that all threads of the same block can access the same memory. Since the number of threads is a multiple of 32, so should be the memory in order to be accessed in a single read. However, I need to have an extra padding around the memory block, a border, so that the width of my array becomes (32*x)+2 columns. The border comes from decomposing the large array, so that I have an overlapping areas in which a copy of its neighbours is temporarily available.
Coeleased memory access:
Imagine the threads of a block are accessing the local memory block
1 int x = threadIdx.x;
2
3 for (int y = 0; y < height; y++)
4 {
5 double value_centre = array[y*width + x+1]; // remeber we have the border so we need an offset of + 1
6 double value_left = array[y*width + x ]; // hence the left element is at x
7 double value_right = array[y*width + x+2]; // and the right element at x+2
8
9 // .. do something
10 }
Now, my understanding is that since I do have an offset (+1,+2), which is unavoidable, I will have at least two reads per warp and per assignment (except for the left elements), or does it not matter from where I start reading as long as the memory after the 1st thread is perfectly aligned? Note also, if that is not the case then I would have unaligned access to the array for each row after the first one, since the width of my array is (32*x)+2, and hence not 32-byte aligned. A further padding would however solve the problem for each new row.
Question: Is my understanding correct that in the example above only the first row would allow coeleased access and only for the left element in the array, since that is the only one which is accessed without any offset?
Thread executed in a warp:
Threads in a warp are only executed in parallel if and only if all the instructions are the same (according to link). If I do have a conditional statement / diverging execution, then that particular thread will be executed by itself and not within a warp with the others.
For example if I initialise the array I could do something like
1 int x = threadIdx.x;
2
3 array[x+1] = globalArray[blockIdx.x * blockDim.x + x]; // remember the border and therefore use +1
4
5 if (x == 0 || x == blockDim.x-1) // border
6 {
7 array[x] = DBL_MAX;
8 }
Will the warp be of size 32 and executed in parallel until line 3 and then stop for all other threads and only the first and last thread further executed to initialise the border, or will those be separated from all other threads already at the beginning, since there is an if statement that all other threads do not fulfill?
Question: How are threads collected into a single warp? Each thread in a warp needs to share the same instructions. Need this to be valid for the whole function? This is not the case for thread 1 (x=0), since it initialises also the border and therefore is different from others. To my understanding, thread 1 is executed in a single warp, thread (2-33, etc.) in another warp, which then doesn't access the memory in a singe read, due to miss-alignment, and then again the final thread in a single warp due to the other border. Is that correct?
I wonder what the best practice is, to have either memory perfectly aligned for each row (in which case I would run each block with (32*x-2) threads so that the array with border is (32*x-2)+2 a multiple of 32 for each new line) or do it the way I had demonstrated above, with threads a multiple of 32 for each block and just live with the unaligned memory. I am aware that these sort of questions are not always straightforward and often depend on particular cases, but sometimes certain things are a bad practice and should not become habit.
When I experimented a little bit, I didn't really notice a difference in execution time, but maybe my examples were just too simple. I tried to get information from the visual profiler, but I haven't really understood all the information it gives me. I got however a warning that my occupancy level is at 17%, which I think must be really low and therefore there is something I do wrong. I didn't manage to find information on how threads are executed in parallel and how efficient my memory access is.
-Edit-
Added and highlighted 2 questions, one about memory access, the other one about how threads are collected to a single warp.
Upvotes: 3
Views: 1215
Reputation: 151799
Now, my understanding is that since I do have an offset (+1,+2), which is unavoidable, I will have at least two reads per warp and per assignment (except for the left elements), or does it not matter from where I start reading as long as the memory after the 1st thread is perfectly aligned?
Yes, it does matter "from where you start reading" if you are trying to achieve perfect coalescing. Perfect coalescing means the read activity for a given warp and a given instruction all comes from the same 128-byte aligned cacheline.
Question: Is my understanding correct that in the example above only the first row would allow coeleased access and only for the left element in the array, since that is the only one which is accessed without any offset?
Yes. For cc2.0 and higher devices, the cache(s) may mitigate some of the drawbacks of unaligned access.
Question: How are threads collected into a single warp? Each thread in a warp needs to share the same instructions. Need this to be valid for the whole function? This is not the case for thread 1 (x=0), since it initialises also the border and therefore is different from others. To my understanding, thread 1 is executed in a single warp, thread (2-33, etc.) in another warp, which then doesn't access the memory in a singe read, due to miss-alignment, and then again the final thread in a single warp due to the other border. Is that correct?
The grouping of threads into warps always follows the same rules, and will not vary based on the specifics of the code you write, but is only affected by your launch configuration. When you write code that not all the threads will participate in (such as in your if statement), then the warp still proceeds in lockstep, but the threads that do not participate are idle. When you are filling in borders like this, it's rarely possible to get perfectly aligned or coalesced reads, so don't worry about it. The machine gives you that flexibility.
Upvotes: 2