Reputation: 6869
I have two arrays/lists of floating point values, x
and y
, and, depending on their respective standard deviations, I compute something:
def some_func(x, y):
x_std = np.std(x)
y_std = np.std(y)
if x_std == 0 or y_std == 0:
res = 0.5
else:
res = x_std * y_std
if x_std == 0 and y_std == 0:
res = 1.0
# Do something else with `res` before returning it
return res
Since this function gets called many, many times, I was wondering if there's a better way to express this but while avoiding nested if/else statements?
I don't think that this is much better:
def some_func(x, y):
x_std = np.std(x)
y_std = np.std(y)
if x_std == 0 and y_std == 0:
res = 1.0
elif x_std == 0 or y_std == 0:
res = 0.5
else:
res = x_std * y_std
# Do something else with `res` before returning it
return res
Upvotes: 0
Views: 290
Reputation:
An array will have zero standard deviation if and only if all the elements are equal (aside from rounding errors). Based on the following experiment, you can test whether this is the case in under a fifth of the time that it takes to actually compute the standard deviation.
$ python -mtimeit -s 'import numpy as np; a=np.zeros((4000,5000))+1' 'np.all(np.equal(a, a.flat[0])'
100 loops, best of 3: 13.1 msec per loop
$ python -mtimeit -s 'import numpy as np; a=np.zeros((4000,5000))+1' 'np.std(a)'
10 loops, best of 3: 71.9 msec per loop
Therefore if a reasonable portion of the time one or other of the arrays will have zero standard deviation, then it will be worth computing this, even though it makes it a bit slower in the event that the standard deviation of both arrays is non-zero.
def some_func(x, y):
x_equal = np.all(np.equal(x, x.flat[0]))
y_equal = np.all(np.equal(y, y.flat[0]))
if x_equal and y_equal:
res = 1.0
elif x_equal or y_equal:
res = 0.5
else:
res = np.std(x) * np.std(y)
# Do something else with `res` before returning it
return res
You can work out the break-even point. If the probability that both of the arrays have non-zero standard deviation is p, and the time to compute the "all equal" test on both arrays is t_eq, and the time to compute the standard deviations t_std, then the break even is where:
t_eq + p t_std = t_std
or p = 1 - (t_eq / t_std),
which based on the above experimentally determined numbers is about 0.82.
Therefore if less than 82% of the time both of the arrays have non-zero standard deviation, i.e. at least 18% of the time one or other (or both) of the arrays has zero standard deviation, then you should see some speed-up from this.
You could squeeze the break-even point a little further, at the expense of a messier implementation, if you wrote your own routine in C (using a ctypes calling interface) to determine whether all the elements of an array are equal, and which returns a negative result as soon as it encounters the first difference between elements, without then having to test the rest of the array. I am not aware of a way to do the test using existing numpy routines in a way that takes advantage of short-circuit evaluation (np.all
should do short-circuit evaluation but you still need to calculate the whole of the input boolean array).
Upvotes: 2