Reputation: 161
I recently signed up for a scientific python course and it was shown in classroom how a numpy array can perform better than a list in some situations. For a statistical simulation, I tried the two approaches and surprinsingly, the numpy array takes a lot longer to finish the process. Could someone help me find my(possible) mistake?
My first thought was that probably something is wrong with the way the code is written, but I can't figure out how it can be wrong. The script calculates how many attempts on average someone needs to complete a collection of sorted stickers:
I used a function and no external modules.
import random as rd
import statistics as st
def collectStickers(experiments, collectible):
obtained = []
attempts = 0
while(len(obtained) < collectible):
new_sticker = rd.randint(1, collectible)
if new_sticker not in obtained:
obtained.append(new_sticker)
attempts += 1
experiments.append(attempts)
experiments = []
collectible = 20
rep_experiment = 100000
for i in range(1, rep_experiment):
collectStickers(experiments, collectible)
print(st.mean(experiments))
The processing time seems ok for a simple experiment like this one, but for more complex purposes, 13.8 seconds is too much.
72.06983069830699
[Finished in 13.8s]
I could not use any function as the following errors showed up when I followed the same logic as above:
RuntimeWarning: Mean of empty slice.
RuntimeWarning: invalid value encountered in double_scalars
So I just went for the naive way:
import random as rd
import numpy as np
experiments = np.array([])
rep_experiment = 100000
for i in range(1, rep_experiment):
obtained = np.array([])
attempts = 0
while(len(obtained) < 20):
new_sticker = rd.randint(1, 20)
if new_sticker not in obtained:
obtained = np.append(obtained, new_sticker)
attempts += 1
experiments = np.append(experiments, attempts)
print(np.mean(experiments))
Almost 4x slower!
Is the difference in the use of a function?
72.03112031120311
[Finished in 54.2s]
Upvotes: 0
Views: 129
Reputation: 682
To really take into account the power of numpy arrays, you need to program in a numpy way. For example, try to vectorize the experiment like this:
def vectorized():
rep_experiment = 100000
collectible = 20
# array of falses
obtained = np.zeros(rep_experiment, dtype=bool)
attempts = np.zeros(rep_experiment, dtype=int)
targets = np.zeros((rep_experiment, collectible), dtype=bool)
x = np.arange(0,100000, step=1, dtype=int)
while False in targets:
r = np.random.randint(0, collectible, size=rep_experiment)
# add the new stickers to the collected target
targets[x,r] = True
# if collected all set obtained to True
obtained[np.sum(targets, axis=1)==collectible] = True
# increments the not obtained values
attempts[~obtained] += 1
print(attempts.mean(), attempts.std())
check the speed up, it is around 50 X for me
Upvotes: 4
Reputation: 168834
np.append
copies the array before appending to it.
Your program is probably spending most of its time doing those unnecessary copies in
experiments = np.append(experiments, attempts)
As expected, replacing the quadratic-esque np.append()
with a predefined array makes the wrapper function approximately the same speed.
Replacing the list of obtained
stickers with a set makes things a little faster.
However, the bottleneck is a slow random number generator. Running through cProfile reveals that 75% of the execution time is spent in randint()
.
See below the code for the results (on my machine).
import random
import statistics
import timeit
import numpy as np
collectible = 20
rep_experiment = 10000
def original_collect_stickers():
obtained = []
attempts = 0
while len(obtained) < collectible:
new_sticker = random.randint(1, collectible)
if new_sticker not in obtained:
obtained.append(new_sticker)
attempts += 1
return attempts
def set_collect_stickers():
obtained = set()
attempts = 0
n = 0
while n < collectible:
new_sticker = random.randint(1, collectible)
if new_sticker not in obtained:
obtained.add(new_sticker)
n += 1
attempts += 1
return attempts
def repeat_with_list(fn):
experiments = []
for i in range(rep_experiment):
experiments.append(fn())
return statistics.mean(experiments)
def repeat_with_numpy(fn):
experiments = np.zeros(rep_experiment)
for i in range(rep_experiment):
experiments[i] = fn()
return np.mean(experiments)
def time_fn(name, fn, n=3):
time_taken = timeit.timeit(fn, number=n) / n
result = fn() # once more to get the result too
print(f"{name:15}: {time_taken:.6f}, result {result}")
for wrapper in (repeat_with_list, repeat_with_numpy):
for fn in (original_collect_stickers, set_collect_stickers):
time_fn(f"{wrapper.__name__} {fn.__name__}", lambda: wrapper(fn))
The result is
repeat_with_list original_collect_stickers: 0.747183, result 71.7912
repeat_with_list set_collect_stickers: 0.688952, result 72.1002
repeat_with_numpy original_collect_stickers: 0.752644, result 72.0978
repeat_with_numpy set_collect_stickers: 0.685355, result 71.7515
Using the fastrand
library's pcg32bounded()
generator, i.e. new_sticker = fastrand.pcg32bounded(collectible)
makes things plenty fast:
repeat_with_list original_collect_stickers: 0.761186, result 72.0185
repeat_with_list set_collect_stickers: 0.690244, result 71.878
repeat_with_list set_collect_stickers_fastrand: 0.116410, result 71.9323
repeat_with_numpy original_collect_stickers: 0.759154, result 71.8604
repeat_with_numpy set_collect_stickers: 0.696563, result 71.5482
repeat_with_numpy set_collect_stickers_fastrand: 0.114212, result 71.6369
Upvotes: 3