blueonken2
blueonken2

Reputation: 11

Speeding up simulation code & ideas to make it more efficient

I created a quick monte-carlo simulation which seems to do what I want (simple version below).

import numpy as np
import os
import scipy.stats as sp

sigma = 0.5
u = 10
mu = 100


no_sim = int(10e3)
no_col = 2
mat_res = np.zeros((no_sim,no_col)) #results matrix holder

lim = 40e3 # parameter 1 - needs to vary
xs = 2000 # parameter 2 - needs to vary


for i in range(1,no_sim):
    no_clm = sp.poisson.rvs(mu=mu,size=1)
    clm_sev = sp.lognorm.rvs(s=sigma,scale=np.exp(u),size=no_clm)
    
    temp_mat = np.zeros((np.size(clm_sev),2)) # temp matrix for calculations
    
    temp_mat[:,0] = clm_sev
    temp_mat[:,1] = np.minimum(lim,np.maximum(0,temp_mat[:,0]-xs)) #want to expand this step for various values of lim and xs, e.g. 5 different options
    
    
    mat_res[i,0] = np.sum(temp_mat[:,0])
    mat_res[i,1] = np.sum(temp_mat[:,1])
    
    

#want these mean values for various different lim and xs values which are predefined
print("mean 1 is %s" % np.mean(mat_res[:,0]),
      "mean 2 is %s" % np.mean(mat_res[:,1]))

I am trying to do two things:

  1. Speed up the efficiency of the code, I will need to run over a million simulations in reality so want to do this in the best way possible - currently it takes a five minutes or so to run a million simulations

  2. expand the code in an efficient way so that I can vary the parameters lim and xs and get a new result - i.e. right now I have lim = 40e3 and xs = 2000

but I would also want to run it with say lim = 50e3 and xs = 1000 and then return back the mean value in the print (along with the original mean for the original lim and xs parameters). One solution is to wrap the for-loop into a function with the parameters I require, however as I only use the lim and xs parameters in one line in the monte carlo simulation I don't think it's efficient to run the whole simulation again from scratch, but I can't think of any good way to build it into the for-loop as it stands.

Upvotes: 0

Views: 1754

Answers (3)

Nick ODell
Nick ODell

Reputation: 25180

I found that you can get a 20x speedup by using vectorization and doing more of the operations in numpy.

I did need to make some slight changes to the simulation itself, though. Specifically, instead of sampling no_clm every time through the loop, I sample it once per "batch," and use the same value for the entire batch.

Once I make this simplification, it's possible to create a matrix of log-norm sampled values, and apply the clipping and summation to each row of that matrix. This means that more of the calculation happens in numpy.

Here's the code:

def sim(xs, lim, no_sim):
    mat_res = np.zeros((no_sim,no_col)) #results matrix holder
    batch_size = 100
    for i in range(0, no_sim, batch_size):
        if i + batch_size >= no_sim:
            batch_size = no_sim - i
        no_clm = sp.poisson.rvs(mu=mu,size=1)[0]
        clm_sev = sp.lognorm.rvs(s=sigma,scale=np.exp(u),size=(batch_size, no_clm))

        mat_res[i:i+batch_size,0] = np.sum(clm_sev, axis=1)
        mat_res[i:i+batch_size,1] = np.sum(np.clip(clm_sev - xs, 0, lim), axis=1)
    return mat_res

Benchmark results:

Method Timing
No vectorization 1.380s per 10k loops
Batch size 10 0.176s per 10k loops
Batch size 100 0.051s per 10k loops
Batch size 1000 0.036s per 10k loops

Increasing batch size will have the benefit of increasing the speed, at the cost of re-using the same no_clm value multiple times. I think a sweet spot for this is around 100.

Because this method only uses one core, this speedup will stack with the parallelization suggestion that @pinxau1000 made.

Upvotes: 0

pinxau1000
pinxau1000

Reputation: 301

You can use multiprocessing library to paralelize your code. If you can work only with numpy's arrays and operations to process your data you can also use numpy's vectorize. As Glauco said you can also optimize your code.

I made a snippet that use multiprocessing. This is not at all optimised, is just a draw for your reference, however in my computer exec time improved roughly 2.9x.

import time
import multiprocessing
from functools import partial

import tqdm
import numpy as np
import scipy.stats as sp


def simulation_0(mu: int, sigma: int, scale: float = np.exp(1)):
    no_clm = sp.poisson.rvs(mu=mu, size=1)

    # Your scale was np.exp(u), I set it to be a parameter for example purposes.
    return sp.lognorm.rvs(s=sigma, scale=scale, size=no_clm)


def simulation_1(xs: int, lim: int, mu: int, sigma: int, scale: float = np.exp(1)):
    # want to expand this step for various values of lim and xs, e.g. 5 different options
    return np.minimum(lim, np.maximum(0, simulation_0(mu=mu, sigma=sigma, scale=scale) - xs))


def simulation(xs: int, lim: int, mu: int, sigma: int, scale: float = np.exp(1)):
    with open("inputs.csv", "a") as fwriter:
        fwriter.write(f"{xs}, {lim}\n")

    sim_0_res = simulation_0(mu=mu, sigma=sigma, scale=scale)
    with open("simulation_0.csv", "a") as fwriter:
        fwriter.write(f"{sim_0_res},")

    sim_1_res = simulation_1(xs=xs, lim=lim, mu=mu, sigma=sigma, scale=scale)
    with open("simulation_1.csv", "a") as fwriter:
        fwriter.write(f"{sim_1_res},")


def simulation_wrapper(xs_lim_tuple: tuple, mu: int, sigma: int, scale: float = np.exp(1)):
    return simulation(xs=xs_lim_tuple[0], lim=xs_lim_tuple[1], mu=mu, sigma=sigma, scale=scale)


def main():
    # Constants
    mu = 0.5
    sigma = 0.5

    # The parameters to sweep
    lim = np.linspace(30e3, 50e3, 100)  # parameter 1 - needs to vary
    xs = np.linspace(1000, 3000, 100)  # parameter 2 - needs to vary
    # Creates a matrix with all the combinations
    sweep_parameters = np.array(np.meshgrid(xs, lim)).T.reshape((np.size(lim) * np.size(xs), 2))

    # Create a pool with N threads (N is set to be the number of threads)
    pool = multiprocessing.Pool(processes=multiprocessing.cpu_count())

    # This allows us to "initialize" the function call, since the mu and sigma will not change throughout the
    # execution we can already set these parameters.
    partial_function = partial(simulation_wrapper, mu=mu, sigma=sigma)

    # TQDM is only used to display a progressbar
    for _ in tqdm.tqdm(pool.imap(partial_function, sweep_parameters), total=np.shape(sweep_parameters)[0]):
        pass


def main_no_parallel():
    # Constants
    mu = 0.5
    sigma = 0.5

    # The parameters to sweep
    lim = np.linspace(30e3, 50e3, 100)  # parameter 1 - needs to vary
    xs = np.linspace(1000, 3000, 100)  # parameter 2 - needs to vary
    # Creates a matrix with all the combinations
    sweep_parameters = np.array(np.meshgrid(xs, lim)).T.reshape((np.size(lim) * np.size(xs), 2))

    # This allows us to "initialize" the function call, since the mu and sigma will not change throughout the execution we
    # can already set these parameters.
    partial_function = partial(simulation_wrapper, mu=mu, sigma=sigma)

    # TQDM is only used to display a progressbar
    for sweep_parameter in tqdm.tqdm(sweep_parameters, total=np.shape(sweep_parameters)[0]):
        partial_function(sweep_parameter)


if __name__ == "__main__":
    exec_times = []
    for i in range(5):
        start_time = time.time()
        main()
        exec_times.append(time.time() - start_time)
        print(f"--- {exec_times[-1]} seconds ---")
    print(f"\nParallelization Avg Exec Time: {np.average(exec_times)}\n")

    exec_times = []
    for i in range(5):
        start_time = time.time()
        main_no_parallel()
        exec_times.append(time.time() - start_time)
        print(f"--- {exec_times[-1]} seconds ---")
    print(f"\nNo-Parallelization Avg Exec Time: {np.average(exec_times)}\n")

Output:

100%|██████████| 10000/10000 [00:01<00:00, 6492.83it/s]
--- 1.5754084587097168 seconds ---
100%|██████████| 10000/10000 [00:01<00:00, 6395.57it/s]
--- 1.585716724395752 seconds ---
100%|██████████| 10000/10000 [00:01<00:00, 6657.03it/s]
--- 1.5238027572631836 seconds ---
100%|██████████| 10000/10000 [00:01<00:00, 6705.56it/s]
--- 1.5113093852996826 seconds ---
100%|██████████| 10000/10000 [00:01<00:00, 6691.30it/s]
--- 1.5149972438812256 seconds ---

Parallelization Avg Exec Time: 1.5422469139099122

100%|██████████| 10000/10000 [00:04<00:00, 2216.83it/s]
--- 4.511844873428345 seconds ---
100%|██████████| 10000/10000 [00:04<00:00, 2202.73it/s]
--- 4.540390491485596 seconds ---
100%|██████████| 10000/10000 [00:04<00:00, 2247.02it/s]
--- 4.450916051864624 seconds ---
100%|██████████| 10000/10000 [00:04<00:00, 2237.04it/s]
--- 4.470769166946411 seconds ---
100%|██████████| 10000/10000 [00:04<00:00, 2215.68it/s]
--- 4.513864517211914 seconds ---

No-Parallelization Avg Exec Time: 4.497557020187378

Upvotes: 1

Glauco
Glauco

Reputation: 1465

Fortunately montecarlo is a trivial problem to parallelize, but i suggest to you to proceed in 2 steps:

  1. optimization You can do benchmark in order to find wich statements are the slowest and try to refactor in an efficient way: there are some profilers to do these kind of stuff (i also wrote perf tool for a similar problem)

  2. parallelize computation Using library like ray joblib or celery you will find a lot of benefits in performance, you should split your computation problem in order to create a bunch of tasks, for example if you works on a single machine with 8CPU, you can reduce the montecarlo in a sum of 8 small montecarlo.

Upvotes: 0

Related Questions