Reputation: 1241
I want to generate data to be modeled. Based on the following logic:
Input: 2 numpy array named z
and d
.
z
: 1d array, value 0/1
d
: 1d array, value 0/1
Return:
y
: 1d array. value: norm random numbers.
If z == 0 and d==0, y ~ norm(1,1),
if z == 0 and d == 1, y~ norm(0,1),
if z == 1 and d == 0, y ~ norm(1,1),
if z == 1 and d == 1, y ~ norm(2,1).
I want to do it in a super fast,clear and pythonic way.
It seems basic mathmatics and np.where
is faster. In this case, I only have 3 conditions (you can see clearly from basic mathmatics part). If I have 10 or more condition, type them in if-else
statement is sometimes confusing. I want to perform data simulations, which means I will generate data in million times at different n
.So, what is the best way to do it?
# generate data
n = 2000
z = np.random.binomial(1,0.5,n)
d = np.random.binomial(1,0.5,n)
def myfun(x):
return {(0,1):np.random.normal(0,1),\
(0,0):np.random.normal(1,1),\
(1,0):np.random.normal(1,1),\
(1,1):np.random.normal(2,1)}[x]
%%timeit
y = [myfun(i) for i in zip(z,d)]
Out:
16.2 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
if-else
%%timeit
y = np.random.normal([0 if (i == 0) & (j ==1) else 2 if (i == 1) & (j == 1) else 1 for i,j in zip(z,d)],1)
Out:
1.38 ms ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
h0 = np.random.normal(0,1,n)
h1 = np.random.normal(1,1,n)
h2 = np.random.normal(2,1,n)
y = (1-z)*d*h0 + (1-d)*h1 + z*d*h2
Out:
140 µs ± 135 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
h0 = np.random.normal(0,1,n)
h1 = np.random.normal(1,1,n)
h2 = np.random.normal(2,1,n)
y = np.where((d== 0),h1,0) + np.where((z ==1) & (d== 1),h2,0) + np.where((z ==0) & (d== 1),h0,0)
Out:
156 µs ± 598 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Is there any other new method?
Upvotes: 2
Views: 258
Reputation: 35146
I would think that the fastest option is to generate your random numbers only once by making use of array-valued parameters to normal
. Using the new random API:
import numpy as np
rng = np.random.default_rng()
# generate data
n = 2000
z = rng.binomial(1, 0.5, n)
d = rng.binomial(1, 0.5, n)
def generate_once(z, d):
"""Generate randoms for https://stackoverflow.com/questions/59676147"""
# encode mean; scale is always 1 anyway in the example
means = np.zeros_like(z, dtype=float)
z_inds = z == 0
d_inds = d == 0
means[d_inds] = 1
means[z_inds & ~d_inds] = 2
# generate the data
y = rng.normal(means)
return y
y = generate_once(z, d)
I didn't try to time this against all the others, but I'd expect this to be competitive. Consider it a faster variant of your if-else
. Taking shortcuts in mapping out the means
(and in general, also the scales
) as an array can reduce overhead, and generating every normal number exactly once should shave off runtime.
Upvotes: 1
Reputation: 3930
It seems you have already done the legwork here. The outcome is now based on trade-offs. All of the above solutions meet the criteria to a different extent.
Is this code to be used to teach other people? Maybe it is executed only once or twice a day? Its part of a larger project and needs to be extremely clear for someone else to maintain? If this is the case choose the slower but easier to understand and read options.
Is it executed thousands or millions of times a day? Resources cost money that reduces profit from a product? If so comment it super well and use the faster options.
It seems the basic mathematics
option is the best tradeoff as it is simple, easy to understand yet quick to execute.
My biased review of each method:
dict case-when
: slow, will require multiple readings / tests to fully understand what is actually happening and to figure out if there are any unknown gotchas.simple if-else
: slow, will require multiple readings / tests to fully understand what is actually happening and to figure out if there are any unknown gotchas.basic mathematics
: fast, easy to understand if you have even a minor mathematical background (which should include most programmers). np.where
: fast, gets the job done, , will require multiple readings / tests to fully understand what is actually happening but less prone to gotchas because its array based. For reference here are the pythonic philosophy of writing code:
Using the above as criteria its much easier to evaluate whether your code is pythonic or not.
Upvotes: 1