foobar
foobar

Reputation: 507

Fast Python/Numpy Frequency-Severity Distribution Simulation

I'm looking for a away to simulate a classical frequency severity distribution, something like: X = sum(i = 1..N, Y_i), where N is for example poisson distributed and Y lognormal.

Simple naive numpy script would be:

import numpy as np
SIM_NUM = 3

X = []
for _i in range(SIM_NUM):
    nr_claims = np.random.poisson(1)
    temp = []
    for _j in range(nr_claims):
         temp.append(np.random.lognormal(0, 1))
    X.append(sum(temp))

Now I try to vectorize that for a performance increase. A bit better is the following:

N = np.random.poisson(1, SIM_NUM)
X = []
for n in N:
    X.append(sum(np.random.lognormal(0, 1, n)))

My question is if I can still vectorize the second loop? For example by simulating all the losses with:

N = np.random.poisson(1, SIM_NUM)
# print(N) would lead to something like [1 3 0]
losses = np.random.lognormal(0,1, sum(N))
# print(N) would lead to something like 
#[ 0.56750244  0.84161871  0.41567216  1.04311742]

# X should now be [ 0.56750244, 0.84161871 + 0.41567216 + 1.04311742, 0] 

Ideas that I have are "smart slicing" or "matrix multiplication with A = [[1, 0, 0, 0]],[0,1,1,1],[0,0,0,0]]. But I couldn't make something clever out these ideas.

I'm looking for fastest possible computation of X.

Upvotes: 3

Views: 786

Answers (2)

user6758673
user6758673

Reputation: 589

You're looking for numpy.add.reduceat:

N = np.random.poisson(1, SIM_NUM)
losses = np.random.lognormal(0,1, np.sum(N))

x = np.zeros(SIM_NUM)
offsets = np.r_[0, np.cumsum(N[N>0])]
x[N>0] = np.add.reduceat(losses, offsets[:-1])

The case where n == 0 is handled separately, because of how reduceat works. Also, be sure to use numpy.sum on arrays instead of the much slower Python sum.

If this is faster than the other answer depends on the mean of your Poisson distribution.

Upvotes: 1

Divakar
Divakar

Reputation: 221624

We can use np.bincount , which is quite efficient for such interval/ID based summing operations specially when working with 1D arrays. The implementation would look something like this -

# Generate all poisson distribution values in one go
pv = np.random.poisson(1,SIM_NUM)

# Use poisson values to get count of total for random lognormal needed.
# Generate all those random numbers again in vectorized way 
rand_arr = np.random.lognormal(0, 1, pv.sum())

# Finally create IDs using pv as extents for use with bincount to do
# ID based and thus effectively interval-based summing
out = np.bincount(np.arange(pv.size).repeat(pv),rand_arr,minlength=SIM_NUM)

Runtime test -

Function definitions :

def original_app1(SIM_NUM):
    X = []
    for _i in range(SIM_NUM):
        nr_claims = np.random.poisson(1)
        temp = []
        for _j in range(nr_claims):
             temp.append(np.random.lognormal(0, 1))
        X.append(sum(temp))
    return X

def original_app2(SIM_NUM):
    N = np.random.poisson(1, SIM_NUM)
    X = []
    for n in N:
        X.append(sum(np.random.lognormal(0, 1, n)))
    return X

def vectorized_app1(SIM_NUM):
    pv = np.random.poisson(1,SIM_NUM)
    r = np.random.lognormal(0, 1,pv.sum())
    return np.bincount(np.arange(pv.size).repeat(pv),r,minlength=SIM_NUM)

Timings on large datasets :

In [199]: SIM_NUM = 1000

In [200]: %timeit original_app1(SIM_NUM)
100 loops, best of 3: 2.6 ms per loop

In [201]: %timeit original_app2(SIM_NUM)
100 loops, best of 3: 6.65 ms per loop

In [202]: %timeit vectorized_app1(SIM_NUM)
1000 loops, best of 3: 252 µs per loop

In [203]: SIM_NUM = 10000

In [204]: %timeit original_app1(SIM_NUM)
10 loops, best of 3: 26.1 ms per loop

In [205]: %timeit original_app2(SIM_NUM)
10 loops, best of 3: 77.5 ms per loop

In [206]: %timeit vectorized_app1(SIM_NUM)
100 loops, best of 3: 2.46 ms per loop

So, we are looking at some 10x+ speedup there.

Upvotes: 3

Related Questions