How to take advantage of parallelism when implementing elementary cellular automata in python?

This is more an exercise in understanding the appropriate way to implement parallelism. I don't necessarily need an optimal program for running elementary cellular automata in particular.

Let's say we have N=10000 cells laid out in 1d. The ECA works by applying a local rule f at each cell i, given its neighbour states and it's own state:

c[t][i] = f(c[t-1][i-1], c[t-1][i], c[t-1][i+1])

The brute force way to evolve this for 9999 timesteps would be something like:

N = 10000
c[0] = np.random.randint(2, size=N)
for t in range(1, 9999):
    for i in range(N):
        temp[i] = f(c[t-1][i-1], c[t-1][i], c[t-1][i+1])
    c[t] = temp.copy()

give or take some variable initializations.

Is there a way to parallelize the N=10000 applications at each time step? Should this be done using threads? Or are those local rules too fast and simple to really take advantage of anything? I have very little knowledge of taking advantage of parallelism with code, but this seems like a natural setting for it.

Upvotes: 1

Answers (1)

Mo0nbase

Reputation: 123

The Issue At Hand

Recently I've been tooling around this issue for a while regarding the feasibility of a multi-threaded application of ECA's and using some basic intuition regarding multi-threading and the nature of elementary cellular automaton it is highly unlikely that multi-threading will help the program at all and in most cases will simply slow down the simulation especially in the case of a relatively un-optimized language case like python.

The key issue with this problem is that multi-threading requires a problem to capable of a distributed workload which theoretically a generic cellular automaton is, however, 1D cellular automatons are dependent on the data of the previous generation (i-1) in order to derive the current generation (i). As a result, if we dedicate say n threads to an array of 1's and 0's with a length of n*256 each thread can compute the rule 258 times (+2 because of boundaries on each division). Now at first, this may seem like a perfectly plausible solution which it is, however, if the goal is speed/efficiency then it will perform considerably worse in varying degrees dependent on the language at hand.

Why Not Multithreading?

Because the computation of 258 comparisons is so fast on a modern processor (whether you're using a bit-wise comparison or doing it programmatically) and because the computational cost of spinning up threads is so great even with a fixed size cellular automaton where you can explicitly define the bounds of each thread on whatever data structure you choose to work with the performance that could be gained by parallelizing this simple calculation is just not there.

In addition to the cost of spinning up threads unless you were programming in a language built for concurrency (i.e. go-lang) where there were tools built-in for speeding up thread initialization and handling of memory access collisions (which you would have to handle given each new cell depends on its closest neighbor that overlap), there are several caveats which you would have to write into your loops to handle conflicts (mostly in memory) and potential system failures (like a thread being locked out by some system process messing up how much of the array each thread must compare). All of these edge cases in your program are going to slow it down significantly especially in python and unless your cellular automaton's initial condition or immediate evolution space was in the order of >= n*500000 on one evolution (from my experience) where the computation time for a single thread is above about 10 seconds for every 500000 cells there will be no performance enhancement and most likely degradation from the use of multi-threading.

Computational Irreducibility

In other words, each thread can only perform calculations on a section of evolution at once we cannot use threads to compute a cellular automaton vertically rather than horizontally. This is core to wolfram's idea of computational irreducibility (i.e the entire automaton must be computed sequentially in order with all immediately previous data accessible to the program). Although in some cases where rules can be generalized to boolean expressions (reference) this could be possible because the outcome is predictable however in the general case this is not due to rules like rules 30, 110, and 54 which display unique computationally irreducible behaviors.

What Can We Do?

With this said there are potential ways to pseudo-multi-thread a cellular automaton with or without boolean expressions (described below). One of these ways described in Wolfram's New Kind of Science (reference) is through taking advantage of a CPU's internal register size to parallelize comparisons the method described in the book is known as "bit-slicing".

Frequently you may hear people make reference to computers being 64 bit vs 32 bit systems well this is dependent on the size of the internal registers. I say we have a 64-bit computer instead of sequentially comparing each 1 and 0 using 1/64th of the register size we can compare units of 64 bits or 8 bytes giving us a theoretical 8x speed yield on a single core. We might do this by arranging say on an 8-bit computer a 1D cellular automaton initial condition vertically as 3 8 bit numbers.

1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1  1  1  0  1  1  0  1  1  0  1  1  0  1  1  0  0  1  1  0  1  1  1  0
     R1|      R2|      R3|      R4|      R5|     R6|      R7|      R8|

   C1 C2 C3 
 
R1| 1  1  1
R2| 0  1  1
R3| 0  1  1
R4| 0  1  1
R5| 0  1  1
R6| 0  0  1
R7| 1  0  1
R8| 1  1  0    

C1 = 131
C2 = 249
C3 = 254

In this configuration the left and right indexes of each bit correspond to the bit on same row on the next decimal number. And using this configuration we can use a rule's boolean expression and bit-wise operators on these numbers (while shifting the first and last numbers) to evaluate the rule. For example on Wolfram Alpha can be described as...

(p, q, r) -> (q AND (NOT p)) OR (q XOR r)

and in go-lang looks like this if we're using a bit-sliced array with all the leading zeros pre-allocated in a 2D simulation. I'm not too familiar with Python access to bare metal instructions but I can certainly say this may be replicated in python but is very fast on a static language like go-lang.

func historicallyAware(evolutions int) {
     for i := 1; i < evolutions+1; i++ {
         sim[i][0] = ((^(sim[i-1][len(sim[i-1])-1]) << 1) & sim[i-1][0]) | (sim[i-1][0] ^ sim[i-1][1])
         for j := 1; j < len(sim[i])-1; j++ {
            sim[i][j] = ((^(sim[i-1][j-1])) & sim[i-1][j]) | (sim[i-1][j] ^ sim[i-1][j+1])
         }
         sim[i][len(sim[i])-1] = ((^(sim[i-1][len(sim[i])-2])) & sim[i-1][len(sim[i])-1]) | (sim[i-1][len(sim[i])-1] ^ sim[i-1][0]>>1)
    }
}

Upvotes: 2