Reputation: 1153
Using Python, I'm running a simulation where a community of species goes through a sequential set of time steps ('scenes') in each of which extinction occurs. From an initial set of N species, each extinction needs to select a number of survivors, which then forms the pool to be subsampled at the next extinction. The number of survivors at each step is drawn randomly from a binomial distribution, given the community size and a per-species probability of survival.
The examples below show a single chain of steps, but in practice the solution needs to be able to cope with branching, where the community surviving at one time step splits into two separate trajectories, each undergoing its own independent extinction.
As a sketch of the process:
1111111111111111 (Initial 16 species, choose 11 survivors)
0110110101101111 (11 species, choose 9 survivors)
0110110101100011 (9 species, choose 5 survivors)
0100100000100011 (End of simulation)
This process gets used a lot and the communities can get quite large, so I'm trying to speed it up as much as possible and keep the memory use down. Currently I have three competing implementations
A) Using a boolean numpy matrix to store which species are alive at each time step. The original motivation for this was that it would have a lower memory profile by just storing the presence/absence of the species, but numpy
uses a full byte to store boolean values so this is eight times less memory efficient than I thought!
import numpy as np
def using_2D_matrix(nspp=1000, nscene=250):
# define a matrix to hold the communities and
# set the initial community
m = np.zeros((nscene, nspp), dtype='bool_')
m[0, ] = 1
# loop over each extinction scene, looking up the indices
# of live species and then selecting survivors
for i in range(0, nscene - 1):
candidates = np.where(m[i,])[0]
n_surv = np.random.binomial(len(candidates), 0.99)
surv = np.random.choice(candidates, size=n_surv, replace=False)
m[i + 1, surv] = 1
return m
B) So storing a dictionary of 1D arrays holding the unique indices for surviving species drops out the need to use np.where
. It might have higher memory use because it will likely need to use uint32
to store the ids, but where extinction is high, you only have to store a short list of indices rather than a whole row of the boolean array, so it is going to be case specific.
def using_dict_of_arrays(nspp=1000, nscene=250):
# initialise a dictionary holding an array giving a
# unique integer to each species
m = {0: np.arange(nspp)}
# loop over the scenes, selecting survivors
for i in range(0, nscene - 1):
n_surv = np.random.binomial(len(m[i]), 0.99)
surv = np.random.choice(m[i], size=n_surv, replace=False)
m[i + 1] = surv
return m
Of these, B is faster by about 10-15%.
import timeit
A = timeit.Timer(using_2D_matrix)
A.timeit(100)
# 1.6549
B = timeit.Timer(using_dictionary_of_arrays)
B.timeit(100)
# 1.3580
C) I then thought about doing this using bitarray
to store the presence or absence of species in communities compactly as actual bits. This could also offer efficiencies by using bitops to compare overlap in communities. So:
def using_bitarray(nspp=1000, nscene=250):
# initialise the starting community
m = {0: bitarray('1' * nspp)}
for i in range(0, nscene):
# pick how many die and which they are (fewer bits to swap)
n_die = np.random.binomial(m[i].count(True), 0.01)
unlucky = np.random.choice(m[i].search(bitarray('1')), size=n_die, replace=False)
# clone the source community and kill some off
m[i + 1] = bitarray(m[i])
for s in unlucky:
m[i + 1][s] = False
return m
All of which is nice, but quite a lot slower.
C = timeit.Timer(using_bitarray)
C.timeit(100)
# 2.54035
Am I missing an approach that would work faster?
Upvotes: 4
Views: 349
Reputation: 221644
The loopy version with the probability as a parameter and to handle a case when candidates
might be a n empty array, we need to exit out/break, would look something like this -
def using_2D_matrix(nspp=1000, nscene=250, prob=0.99):
m = np.zeros((nscene, nspp), dtype='bool_')
m[0, ] = 1
for i in range(0, nscene - 1):
candidates = np.where(m[i,])[0]
if len(candidates)==0:
break
n_surv = np.random.binomial(len(candidates), prob)
surv = np.random.choice(candidates, size=n_surv, replace=False)
m[i + 1, surv] = 1
return m
Now, upon a closer look, we would see that the code basically select random unique elements per row and for the subsequent ones, keeps selecting unique ones, but only that are already chosen for the previous rows. The number of unique elements to be selected is based upon the probablity parameter prob
. Thus, with a high probablity like 0.99
, it will select 0.99%
for the second row, since for the first row, we are already choosen all with m[0, ] = 1
. Then for the third row, it will have 0.99%
of the choosen ones from the second row, which comes to be 0.99*0.99%=0.9801%
and so on. So, the pattern is that we are choosing 0.99^([0,1,2,3...])
elements per row as go from 1st row onwards.
The idea that we could leverage here is that if we could generate a 2D array with all possible indices per row that are randomly scattered and select the first 100% elements for the first row, first 0.99%
for the second row, first 0.9801
elements for the third row and so on, those would be the column indices to be set in the output mask array.
That's the whole idea there to give us a vectorized solution!
Implementation -
def vectorized_app(nspp=1000, nscene=250, prob=0.99):
r = np.arange(nscene)
lims = np.rint(nspp*(prob**(r))).astype(int)
rands = np.random.rand(nscene, nspp).argpartition(0,axis=1)
mask = lims[:,None] > np.arange(nspp)
row_idx = np.repeat(r,lims)
col_idx = rands[mask]
out = np.zeros((nscene, nspp), dtype='bool')
out[row_idx, col_idx] = 1
return out
Sample run -
In [159]: out = vectorized_app(nspp=1000, nscene=250, prob=0.99)
In [160]: s = out.sum(1)
In [161]: s
Out[161]:
array([1000, 990, 980, 970, 961, 951, 941, 932, 923, 914, 904,
895, 886, 878, 869, 860, 851, 843, 835, 826, 818, 810,
...........................................
88, 87, 86, 85, 84, 84, 83, 82])
Let's test out the performance -
In [119]: %timeit using_2D_matrix(nspp=1000, nscene=250, prob=0.99)
100 loops, best of 3: 8 ms per loop
In [120]: %timeit vectorized_app(nspp=1000, nscene=250, prob=0.99)
100 loops, best of 3: 3.76 ms per loop
In [121]: 8/3.76
Out[121]: 2.127659574468085
Now, the bottleneck to the proposed approach would be generating the random numbers, specifically the number of random numbers needed. So, the vectorized approach would be at disadvantage if you are working with a higher number of nspp
and relatively smaller nscene
, along which the loopy version is iterating -
In [143]: %timeit using_2D_matrix(nspp=10000, nscene=2500, prob=0.99)
10 loops, best of 3: 53.8 ms per loop
In [144]: %timeit vectorized_app(nspp=10000, nscene=2500, prob=0.99)
1 loops, best of 3: 309 ms per loop
With nscene
being a bigger number, the results would be in favour of the vectorized one -
In [145]: %timeit using_2D_matrix(nspp=100, nscene=2500, prob=0.99)
100 loops, best of 3: 10.6 ms per loop
In [146]: %timeit vectorized_app(nspp=100, nscene=2500, prob=0.99)
100 loops, best of 3: 3.24 ms per loop
In [147]: %timeit using_2D_matrix(nspp=10, nscene=2500, prob=0.99)
100 loops, best of 3: 5.72 ms per loop
In [148]: %timeit vectorized_app(nspp=10, nscene=2500, prob=0.99)
1000 loops, best of 3: 589 µs per loop
Lesson(s) learnt
Through the ideas that went through, while trying to come up with the proposed solution, the trick I learnt in the process is that we could create unique random numbers per row using a random numnber generation and then using np.argpartition
. Here's a sample of it to have unique elements per row -
In [149]: np.random.rand(3, 4).argpartition(0,axis=1)
Out[149]:
array([[3, 1, 2, 0],
[0, 1, 2, 3],
[1, 0, 2, 3]])
Upvotes: 1
Reputation: 55489
You can speed things up by not locating and counting the survivors at each step.
Let p be the probability that a survivor will survive this step. Rather than searching for each survivor and marking them as extinct with probability p we simply kill off all species with probability p, whether they are currently a survivor or not. Here's a short proof of concept.
import numpy as np
np.random.seed(42)
def test(nspp, nscene):
m = np.zeros((nscene, nspp), dtype=np.uint8)
m[0,] = 1
for i in range(1, nscene):
m[i] = m[i - 1] & (np.random.ranf(nspp) < 0.9)
return m
m = test(10000, 10)
print(np.sum(m, axis=1))
output
[10000 9039 8112 7298 6558 5912 5339 4829 4388 3939]
Of course, this approach means you can't specify an exact number of survivors at each step, but hopefully that's not necessary for your simulation.
Upvotes: 1
Reputation: 114921
Here's an alternative that's pretty fast:
def using_shuffled_array(nspp=1000, nscene=250):
a = np.arange(nspp)
np.random.shuffle(a)
m = np.zeros(nscene, dtype=int)
m[0] = nspp
# loop over the scenes, selecting survivors
for i in range(0, nscene - 1):
m[i + 1] = np.random.binomial(m[i], 0.99)
return a, m
Instead of generating a separate array for each generation, it shuffles the initial sequence of species numbers once, and then for each generation, it determines how many survive. After the call a, m = using_shuffled_array()
, a[:m[k]]
gives the survivors at generation k
.
Here's a timing comparison:
In [487]: %timeit using_dict_of_arrays()
100 loops, best of 3: 7.93 ms per loop
In [488]: %timeit using_shuffled_array()
1000 loops, best of 3: 607 µs per loop
Upvotes: 1