David_O
David_O

Reputation: 1153

Fast indexing and bit flipping of boolean data using Python

Using Python, I'm running a simulation where a community of species goes through a sequential set of time steps ('scenes') in each of which extinction occurs. From an initial set of N species, each extinction needs to select a number of survivors, which then forms the pool to be subsampled at the next extinction. The number of survivors at each step is drawn randomly from a binomial distribution, given the community size and a per-species probability of survival.

The examples below show a single chain of steps, but in practice the solution needs to be able to cope with branching, where the community surviving at one time step splits into two separate trajectories, each undergoing its own independent extinction.

As a sketch of the process:

1111111111111111  (Initial 16 species, choose 11 survivors)
0110110101101111  (11 species, choose 9 survivors)
0110110101100011  (9 species, choose 5 survivors)
0100100000100011  (End of simulation)

This process gets used a lot and the communities can get quite large, so I'm trying to speed it up as much as possible and keep the memory use down. Currently I have three competing implementations

A) Using a boolean numpy matrix to store which species are alive at each time step. The original motivation for this was that it would have a lower memory profile by just storing the presence/absence of the species, but numpy uses a full byte to store boolean values so this is eight times less memory efficient than I thought!

import numpy as np

def using_2D_matrix(nspp=1000, nscene=250):

    # define a matrix to hold the communities and 
    # set the initial community 
    m = np.zeros((nscene, nspp), dtype='bool_')
    m[0, ] = 1

    # loop over each extinction scene, looking up the indices
    # of live species and then selecting survivors
    for i in range(0, nscene - 1):
        candidates = np.where(m[i,])[0]
        n_surv = np.random.binomial(len(candidates), 0.99)
        surv = np.random.choice(candidates, size=n_surv, replace=False)
        m[i + 1, surv] = 1

    return m

B) So storing a dictionary of 1D arrays holding the unique indices for surviving species drops out the need to use np.where. It might have higher memory use because it will likely need to use uint32 to store the ids, but where extinction is high, you only have to store a short list of indices rather than a whole row of the boolean array, so it is going to be case specific.

def using_dict_of_arrays(nspp=1000, nscene=250):

    # initialise a dictionary holding an array giving a 
    # unique integer to each species
    m = {0: np.arange(nspp)}

    # loop over the scenes, selecting survivors
    for i in range(0, nscene - 1):
        n_surv = np.random.binomial(len(m[i]), 0.99)
        surv = np.random.choice(m[i], size=n_surv, replace=False)
        m[i + 1] = surv

    return m

Of these, B is faster by about 10-15%.

import timeit
A = timeit.Timer(using_2D_matrix)
A.timeit(100)
# 1.6549
B = timeit.Timer(using_dictionary_of_arrays)
B.timeit(100)
# 1.3580

C) I then thought about doing this using bitarray to store the presence or absence of species in communities compactly as actual bits. This could also offer efficiencies by using bitops to compare overlap in communities. So:

def using_bitarray(nspp=1000, nscene=250):
    # initialise the starting community
    m = {0: bitarray('1' * nspp)}

    for i in range(0, nscene):
        # pick how many die and which they are (fewer bits to swap)
        n_die = np.random.binomial(m[i].count(True), 0.01)
        unlucky = np.random.choice(m[i].search(bitarray('1')), size=n_die, replace=False)
        # clone the source community and kill some off
        m[i + 1] = bitarray(m[i])
        for s in unlucky:
            m[i + 1][s] = False

    return m

All of which is nice, but quite a lot slower.

C = timeit.Timer(using_bitarray)
C.timeit(100)
# 2.54035

Am I missing an approach that would work faster?

Upvotes: 4

Views: 349

Answers (3)

Divakar
Divakar

Reputation: 221644

Prospective approach

The loopy version with the probability as a parameter and to handle a case when candidates might be a n empty array, we need to exit out/break, would look something like this -

def using_2D_matrix(nspp=1000, nscene=250, prob=0.99):
    m = np.zeros((nscene, nspp), dtype='bool_')
    m[0, ] = 1
    for i in range(0, nscene - 1):
        candidates = np.where(m[i,])[0]
        if len(candidates)==0:
            break    
        n_surv = np.random.binomial(len(candidates), prob)
        surv = np.random.choice(candidates, size=n_surv, replace=False)
        m[i + 1, surv] = 1
    return m

Now, upon a closer look, we would see that the code basically select random unique elements per row and for the subsequent ones, keeps selecting unique ones, but only that are already chosen for the previous rows. The number of unique elements to be selected is based upon the probablity parameter prob. Thus, with a high probablity like 0.99, it will select 0.99% for the second row, since for the first row, we are already choosen all with m[0, ] = 1. Then for the third row, it will have 0.99% of the choosen ones from the second row, which comes to be 0.99*0.99%=0.9801% and so on. So, the pattern is that we are choosing 0.99^([0,1,2,3...]) elements per row as go from 1st row onwards.

The idea that we could leverage here is that if we could generate a 2D array with all possible indices per row that are randomly scattered and select the first 100% elements for the first row, first 0.99% for the second row, first 0.9801 elements for the third row and so on, those would be the column indices to be set in the output mask array.

That's the whole idea there to give us a vectorized solution!

Implementation -

def vectorized_app(nspp=1000, nscene=250, prob=0.99):
    r = np.arange(nscene)
    lims = np.rint(nspp*(prob**(r))).astype(int)

    rands = np.random.rand(nscene, nspp).argpartition(0,axis=1)

    mask = lims[:,None] > np.arange(nspp)
    row_idx = np.repeat(r,lims)
    col_idx = rands[mask]

    out = np.zeros((nscene, nspp), dtype='bool')
    out[row_idx, col_idx] = 1
    return out

Sample run -

In [159]: out = vectorized_app(nspp=1000, nscene=250, prob=0.99)

In [160]: s = out.sum(1)

In [161]: s
Out[161]: 
array([1000,  990,  980,  970,  961,  951,  941,  932,  923,  914,  904,
        895,  886,  878,  869,  860,  851,  843,  835,  826,  818,  810,
        ...........................................
         88,   87,   86,   85,   84,   84,   83,   82])

Timings and related discussion

Let's test out the performance -

In [119]: %timeit using_2D_matrix(nspp=1000, nscene=250, prob=0.99)
100 loops, best of 3: 8 ms per loop

In [120]: %timeit vectorized_app(nspp=1000, nscene=250, prob=0.99)
100 loops, best of 3: 3.76 ms per loop

In [121]: 8/3.76
Out[121]: 2.127659574468085

Now, the bottleneck to the proposed approach would be generating the random numbers, specifically the number of random numbers needed. So, the vectorized approach would be at disadvantage if you are working with a higher number of nspp and relatively smaller nscene, along which the loopy version is iterating -

In [143]: %timeit using_2D_matrix(nspp=10000, nscene=2500, prob=0.99)
10 loops, best of 3: 53.8 ms per loop

In [144]: %timeit vectorized_app(nspp=10000, nscene=2500, prob=0.99)
1 loops, best of 3: 309 ms per loop

With nscene being a bigger number, the results would be in favour of the vectorized one -

In [145]: %timeit using_2D_matrix(nspp=100, nscene=2500, prob=0.99)
100 loops, best of 3: 10.6 ms per loop

In [146]: %timeit vectorized_app(nspp=100, nscene=2500, prob=0.99)
100 loops, best of 3: 3.24 ms per loop

In [147]: %timeit using_2D_matrix(nspp=10, nscene=2500, prob=0.99)
100 loops, best of 3: 5.72 ms per loop

In [148]: %timeit vectorized_app(nspp=10, nscene=2500, prob=0.99)
1000 loops, best of 3: 589 µs per loop

Lesson(s) learnt

Through the ideas that went through, while trying to come up with the proposed solution, the trick I learnt in the process is that we could create unique random numbers per row using a random numnber generation and then using np.argpartition. Here's a sample of it to have unique elements per row -

In [149]: np.random.rand(3, 4).argpartition(0,axis=1)
Out[149]: 
array([[3, 1, 2, 0],
       [0, 1, 2, 3],
       [1, 0, 2, 3]])

Upvotes: 1

PM 2Ring
PM 2Ring

Reputation: 55489

You can speed things up by not locating and counting the survivors at each step.

Let p be the probability that a survivor will survive this step. Rather than searching for each survivor and marking them as extinct with probability p we simply kill off all species with probability p, whether they are currently a survivor or not. Here's a short proof of concept.

import numpy as np

np.random.seed(42)

def test(nspp, nscene):
    m = np.zeros((nscene, nspp), dtype=np.uint8)
    m[0,] = 1
    for i in range(1, nscene):
        m[i] = m[i - 1] & (np.random.ranf(nspp) < 0.9)
    return m

m = test(10000, 10)
print(np.sum(m, axis=1))

output

[10000  9039  8112  7298  6558  5912  5339  4829  4388  3939]

Of course, this approach means you can't specify an exact number of survivors at each step, but hopefully that's not necessary for your simulation.

Upvotes: 1

Warren Weckesser
Warren Weckesser

Reputation: 114921

Here's an alternative that's pretty fast:

def using_shuffled_array(nspp=1000, nscene=250):
    a = np.arange(nspp)
    np.random.shuffle(a)

    m = np.zeros(nscene, dtype=int)
    m[0] = nspp

    # loop over the scenes, selecting survivors
    for i in range(0, nscene - 1):
        m[i + 1] = np.random.binomial(m[i], 0.99)

    return a, m

Instead of generating a separate array for each generation, it shuffles the initial sequence of species numbers once, and then for each generation, it determines how many survive. After the call a, m = using_shuffled_array(), a[:m[k]] gives the survivors at generation k.

Here's a timing comparison:

In [487]: %timeit using_dict_of_arrays()
100 loops, best of 3: 7.93 ms per loop

In [488]: %timeit using_shuffled_array()
1000 loops, best of 3: 607 µs per loop

Upvotes: 1

Related Questions