Alexander Dzhoganov
Alexander Dzhoganov

Reputation: 236

How to optimize Conway's game of life for CUDA?

I've written this CUDA kernel for Conway's game of life:

__global__ void gameOfLife(float* returnBuffer, int width, int height) {  
    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;  
    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;  
    float p = tex2D(inputTex, x, y);  
    float neighbors = 0;  
    neighbors += tex2D(inputTex, x+1, y);  
    neighbors += tex2D(inputTex, x-1, y);  
    neighbors += tex2D(inputTex, x, y+1);  
    neighbors += tex2D(inputTex, x, y-1);  
    neighbors += tex2D(inputTex, x+1, y+1);  
    neighbors += tex2D(inputTex, x-1, y-1);  
    neighbors += tex2D(inputTex, x-1, y+1);  
    neighbors += tex2D(inputTex, x+1, y-1);  
    __syncthreads();  
    float final = 0;  
    if(neighbors < 2) final = 0;  
    else if(neighbors > 3) final = 0;  
    else if(p != 0) final = 1;  
    else if(neighbors == 3) final = 1;  
    __syncthreads();  
    returnBuffer[x + y*width] = final;  
}

I am looking for errors/optimizations. Parallel programming is quite new to me and I am not sure if I get how to do it right.

The rest is a memcpy from an input array to the 2D texture inputTex bound to a CUDA array. Output is memcpy-ed from global memory to host and then dealt with.

As you can see a thread deals with a single pixel. I am unsure if that is the fastest way as some sources suggest doing a row or more per thread. If I understand correctly NVidia themselves say that the more threads, the better. I would love advice on this from someone with practical experience.

Upvotes: 12

Views: 4544

Answers (4)

huseyin tugrul buyukisik
huseyin tugrul buyukisik

Reputation: 11916

The following optimization gets ~140GB/s output throughput (1 char per cell) on rtx4070:

// Computing 4 outputs at once.
// a bit faster than 4x4 lookup table but a lot less readable.
constexpr unsigned short topLeftMask = 0b1110111011100000;
constexpr unsigned short topRightMask = 0b0111011101110000;
constexpr unsigned short botLeftMask = 0b0000111011101110;
constexpr unsigned short botRightMask = 0b0000011101110111;
constexpr unsigned char  topLeftPos = 0b00001000;
constexpr unsigned char  topRightPos = 0b00000100;
constexpr unsigned char  botLeftPos = 0b00000010;
constexpr unsigned char  botRightPos = 0b00000001;
int ct1 = __popc(topLeftMask & key);
int ct2 = __popc(topRightMask & key);
int ct3 = __popc(botLeftMask & key);
int ct4 = __popc(botRightMask & key);
unsigned char result1 = (ct1 >= 2 && ct1 <= 3) * topLeftPos;
unsigned char result2 = (ct2 >= 2 && ct2 <= 3) * topRightPos;
unsigned char result3 = (ct3 >= 2 && ct3 <= 3) * botLeftPos;
unsigned char result4 = (ct4 >= 2 && ct4 <= 3) * botRightPos;
const unsigned char result = result1 | result2 | result3 | result4;

This computes cells using only mask values and counting bits (more efficient than counting neighbors separately). Requires inputs as 16bit representation of 4x4 region and 4bit representation for result (2x2 region).

It doesn't have warp divergence, no extra memory access other than the inputs and uses integer cores instead of floating point cores. So rtx5000 series would have 2x performance per core such as 600-700 GB/s on rtx5080 (guesstimate).

A warp-shuffle version with a tile size of 8x3 may be able to work with less number of instructions.

Upvotes: -1

Johan
Johan

Reputation: 76703

TL;DR: see: http://golly.sourceforge.net

The problem is that most CUDA implementations follow the brain dead idea of manual counting of the neighbors. This is so dead slow that any smart serial CPU implementation will outperform it.

The only sensible way to do GoL calculations is using a lookup table.
The currently fastest implementations on a CPU use lookup a square 4x4 = 16 bit block to see get the future 2x2 cells inside.

in this setup the cells are laid out like so:

 01234567
0xxxxxxxx //byte0
1xxxxxxxx //byte1 
2  etc
3
4
5
6
7

Some bit-shifting is employed to get a 4x4 block to fit into a word and that word is looked up using a lookup table. The lookup tables holds words as well, this way 4 different versions of the outcome can be stored in the lookup table, so you can minimize the amount of bitshifting needed to be done on the input and/or the output.

In addition the different generations are staggered, so that you only have to look at 4 neighboring slabs, instead of 9. Like so:

AAAAAAAA 
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
           BBBBBBBB
//odd generations (A) are 1 pixel above and to the right of B,
//even generations (B) are 1 pixels below and to the left of A.

This alone results in a 1000x+ speed-up compared to silly counting implementations.

Then there is the optimization of not calculating slabs that are static or have a periodicity of 2.

And then there is HashLife, but that's a different beast altogether.
HashLife can generate Life patterns in O(log n) time, instead of the O(n) time normal implementations can. This allows you to calculate generation: 6,366,548,773,467,669,985,195,496,000 (6 octillion) in mere seconds.
Unfortunately Hashlife requires recursion, and thus is difficult on CUDA.

Upvotes: 4

Eri
Eri

Reputation: 31

have a look at this thread, we did allot of improvements over there ...

http://forums.nvidia.com/index.php?showtopic=152757&st=60

Upvotes: 3

Eugene Smith
Eugene Smith

Reputation: 9398

My two cents.

The whole thing looks likely to be bounded by the latency of communication between multiprocessors and the GPU memory. You have code that should take something like 30-50 clock ticks to execute on its own, and it generates at least 3 memory accesses which take 200+ clock ticks each if the requisite data is not in the cache.

Using texture memory is a good way to address that, but it is not necessarily the optimal way.

At the very least, try to do 4 pixels at a time (horizontally) per thread. Global memory can be accessed 128 bytes at a time (as long as you have a warp trying to access any byte in a 128-byte interval, you might as well pull in the whole cache line at almost no additional cost). Since a warp is 32 threads, having each thread work on 4 pixels should be efficient.

Furthermore, you want to have vertically-adjacent pixels worked on by the same multiprocessor. The reason is that adjacent rows share the same input data. If you have the pixel (x=0,y=0) worked on by one MP and the pixel (x=0,y=1) is worked on by a different MP, both MPs have to issue three global memory requests each. If they are both worked on by the same MP and the results are properly cached (implicitly or explicitly), you only need a total of four. This can be done by having each thread work on several vertical pixels, or by having blockDim.y>1.

More generally, you'd probably want to have each 32-thread warp load as much memory as you have available on the MP (16-48 kb, or at least a 128x128 block), and then process all pixels within that window.

On devices of compute compatibility before 2.0, you'll want to use shared memory. On devices of compute compatibility 2.0 and 2.1, caching capabilities are much improved, so global memory may be fine.

Some nontrivial savings could be had by making sure that each warp only accesses two cache lines in each horizontal row of input pixels, instead of three, as would happen in a naive implementation that works on 4 pixels per thread, 32 threads per warp.

There's no good reason to use float as the buffer type. Not only do you end up with four times the memory bandwidth, but the code becomes unreliable and bug-prone. (For example, are you sure that if(neighbors == 3) works correctly, since you're comparing a float and an integer?) Use unsigned char. Better yet, use uint8_t and typedef it to mean unsigned char if it's not defined.

Finally, don't underestimate the value of experimenting. Quite often performance of CUDA code can't be easily explained by logic and you have to resort to tweaking parameters and seeing what happens.

Upvotes: 12

Related Questions