Reputation: 3870
I am simulating a infinite sequence of die rolls to calculate the average "hitting time" for a sequence. In this particular case I am looking for the first occurrence of a "11" or a "12". For example in "34241213113..." the first occurrence of "12 is at time 6 and that of "11" is at time 10. Here is my python code.
import numpy as np
NN=1000000
t11=np.zeros(NN)
t12=np.zeros(NN)
for i in range(NN):
prev=np.random.randint(1,7)
flag11=True
flag12=True
ctr=2
while flag11 or flag12:
curr=np.random.randint(1,7)
if flag11 and prev==1 and curr==1:
t11[i]=ctr
flag11=False
if flag12 and prev==1 and curr==2:
t12[i]=ctr
flag12=False
ctr=ctr+1;
prev=curr
print('Mean t11: %f' %(np.mean(t11)))
print('\nMean t12: %f' %(np.mean(t12)))
As soon as both the sequences have been observed, we start a new sample. It takes about a million sample paths before the expected values converge to the theoretical ones (42 for "11" and 36 for "12"). And the code takes about a minute to run. I am new to python, and have been using it for just about a month.
I was wondering if there was a way to speed up the code, maybe a different algorithm, or maybe optimizing the routines? Would it have notably different performance on a compiled language vs. an interpreted language? I am
Upvotes: 0
Views: 101
Reputation: 47010
There is a good tool for this problem: a finite state machine.
But Python isn't a very good the language for a fast implementation.
Here is a non-deterministic state machine that recognizes the two sequences in any stream of inputs. The *
denotes throws of 3 through 6:
This is hard to implement because it can occupy more than one state at a time, but there's a standard algorithm called the subset construction that converts this to a deterministic state machine, which is very efficient to implement. Applying that here produces this:
Here is a C implementation. In Python you would use a map taking a tuple of the current state plus the input number to the next state. Here we use goto
to implement the map in terms of position in the executing code:
#include <stdio.h>
#include <stdlib.h>
#define ROLL do { r = 1 + rand() %6; } while (0)
int main(void) {
int n_trials = 10000000;
int t_total_11 = 0;
int t_total_12 = 0;
for (int n = 0; n < n_trials; n++) {
int r, t = -1, t_11 = 0, t_12 = 0;
A:
++t;
ROLL;
if (r == 1) goto AB;
goto A;
AB:
++t;
ROLL;
if (r == 1) goto ABC;
if (r == 2) goto AD;
goto A;
ABC:
++t;
if (!t_11) {
t_11 = t;
t_total_11 += t_11;
if (t_12) continue;
}
ROLL;
if (r == 1) goto ABC;
if (r == 2) goto AD;
goto A;
AD:
++t;
if (!t_12) {
t_12 = t;
t_total_12 += t_12;
if (t_11) continue;
}
ROLL;
if (r == 1) goto AB;
goto A;
}
printf("Avg for 11: %lf\n", (double) t_total_11 / n_trials);
printf("Avg for 12: %lf\n", (double) t_total_12 / n_trials);
return 0;
}
On my old Macbook, this does 10 million (not 1 million) iterations in 5.3 seconds. So it's a cool ~100 times faster. Of course a good bit is dependent on the speed of the PRNG. Gnu's rand
is fast but not so random. Apparently it's good enough to illustrate convergence though. The program prints:
Avg for 11: 41.986926
Avg for 12: 35.997196
When I have more time, will try a Python impl.
Upvotes: 1
Reputation: 1350
Here is a Cython implementation of your code snippet which analyzes 10^6 dicerolls in 0.7 seconds on my machine:
from libc.stdlib cimport rand
import numpy as np
cimport numpy as np
DTYPE = np.int64
ctypedef np.int64_t DTYPE_T
cdef int simple_minded_randint(int min_val, int max_val):
"""For demonstration purpose only! Does not generate a uniform distribution."""
return min_val + rand() % max_val
def diceroll(numrolls):
cdef long NN = numrolls
cdef long i
cdef DTYPE_T ctr, prev, curr
cdef int flag11, flag12
cdef np.ndarray[DTYPE_T, ndim=1] t11 = np.zeros(NN, dtype=DTYPE)
cdef np.ndarray[DTYPE_T, ndim=1] t12 = np.zeros(NN, dtype=DTYPE)
for i in range(NN):
prev = simple_minded_randint(1, 6)
flag11 = 1
flag12 = 1
ctr = 2
while flag11 or flag12:
curr = simple_minded_randint(1, 6)
if flag11 and prev == 1 and curr == 1:
t11[i] = ctr
flag11 = 0
if flag12 and prev == 1 and curr == 2:
t12[i] = ctr
flag12 = 0
ctr = ctr + 1
prev = curr
print('Mean t11: %f' %(np.mean(t11)))
print('Mean t12: %f' %(np.mean(t12)))
I added some static types and use the random generator from the C standard library, because using np.random.randint()
in a loop slows things down quite a bit. Note that this random generator is for demonstration purpose only as it does not generate a uniform distribution, see this answer .
Upvotes: 0
Reputation: 155594
You can speed this up a lot by doing more work per call to numpy
(blocks of random instead of single values), and simplifying your pattern search using Python built-in bytes
scanning:
import numpy as np
NN=1000000
t11=np.zeros(NN)
t12=np.zeros(NN)
for i in range(NN):
block = b'\xff' # Prepop w/garbage byte so first byte never part of cnt
flag11 = flag12 = True
ctr = 1 # One lower to account for non-generated first byte
while flag11 or flag12:
# Generate 100 numbers at once, much faster than one at a time,
# store as bytes for reduced memory and cheap searches
# Keep last byte of previous block so a 1 at end matches 1/2 at beginning of next
block = block[-1:] + bytes(np.random.randint(1, 7, 100, np.uint8))
# Containment test scans faster in C than Python level one-at-a-time check
if flag11 and b'\x01\x01' in block:
t11[i] = ctr + block.index(b'\x01\x01')
flag11 = False
if flag12 and b'\x01\x02' in block:
t12[i] = ctr + block.index(b'\x01\x02')
flag12 = False
ctr += 100
print('Mean t11: %f' %(np.mean(t11)))
print('\nMean t12: %f' %(np.mean(t12)))
On my (admittedly underpowered machine) your original code took ~96 seconds to run; my optimized version took ~6.6 seconds, or about 7% of the original runtime. Even given that (on average) over half the random generated is not needed, it's still faster to do so when it avoids more Python level work to loop and try again.
Rewriting a bit further, you can avoid the double-scan of block
by changing:
if flag11 and b'\x01\x01' in block:
t11[i] = ctr + block.index(b'\x01\x01')
flag11 = False
to the somewhat more verbose, but more efficient:
if flag11:
try:
t11[i] = ctr + block.index(b'\x01\x01')
except ValueError:
pass
else:
flag11 = False
(and make an equivalent change to flag12
testing)
Since the first 100 bytes generated typically have a hit anyway, this means you replace two scans with one, and reduces overall runtime to ~6.0 seconds. There are more extreme microoptimizations available (that are more about knowing the CPython internals than any logical improvements) that can get it down to ~5.4 seconds on my machine, but they're ugly and not worth the bother 99.9% of the time.
Upvotes: 1
Reputation: 1
Performance of programs in compiled languages is significantly better than that of an interpreted language. This is the reason high frequency trading, video game engines and other highly demanding software is programmed in compiled languages such as c++.
In terms of optimizations, you could try using pythons compilation features, or running the program natively instead of inside a IDE.
Upvotes: 0