Python list performs better than numpy array?

I recently signed up for a scientific python course and it was shown in classroom how a numpy array can perform better than a list in some situations. For a statistical simulation, I tried the two approaches and surprinsingly, the numpy array takes a lot longer to finish the process. Could someone help me find my(possible) mistake?

My first thought was that probably something is wrong with the way the code is written, but I can't figure out how it can be wrong. The script calculates how many attempts on average someone needs to complete a collection of sorted stickers:

Python list

I used a function and no external modules.

import random as rd
import statistics as st

def collectStickers(experiments, collectible):

    obtained = []
    attempts = 0

    while(len(obtained) < collectible):

        new_sticker = rd.randint(1, collectible)
        if new_sticker not in obtained:
            obtained.append(new_sticker)
        attempts += 1

    experiments.append(attempts)


experiments = []
collectible = 20
rep_experiment = 100000


for i in range(1, rep_experiment):
    collectStickers(experiments, collectible)

print(st.mean(experiments))

Results

The processing time seems ok for a simple experiment like this one, but for more complex purposes, 13.8 seconds is too much.

72.06983069830699
[Finished in 13.8s]

Numpy

I could not use any function as the following errors showed up when I followed the same logic as above:

So I just went for the naive way:

import random as rd
import numpy as np

experiments = np.array([])
rep_experiment = 100000


for i in range(1, rep_experiment):
    obtained = np.array([])
    attempts = 0
    while(len(obtained) < 20):

        new_sticker = rd.randint(1, 20)
        if new_sticker not in obtained:
            obtained = np.append(obtained, new_sticker)
        attempts += 1

    experiments = np.append(experiments, attempts)

print(np.mean(experiments))

Results

Almost 4x slower!

Is the difference in the use of a function?

72.03112031120311
[Finished in 54.2s]

Upvotes: 0

Views: 129

Answers (2)

Oily
Oily

Reputation: 682

To really take into account the power of numpy arrays, you need to program in a numpy way. For example, try to vectorize the experiment like this:

def vectorized():
    rep_experiment = 100000
    collectible = 20
    # array of falses
    obtained = np.zeros(rep_experiment, dtype=bool)
    attempts = np.zeros(rep_experiment, dtype=int)

    targets = np.zeros((rep_experiment, collectible), dtype=bool)

    x = np.arange(0,100000, step=1, dtype=int)

    while False in targets:
        r = np.random.randint(0, collectible, size=rep_experiment)
        # add the new stickers to the collected target
        targets[x,r] = True
        # if collected all set obtained to True
        obtained[np.sum(targets, axis=1)==collectible] = True
        # increments the not obtained values
        attempts[~obtained] += 1


    print(attempts.mean(), attempts.std())

check the speed up, it is around 50 X for me

Upvotes: 4

AKX
AKX

Reputation: 168834

np.append copies the array before appending to it.

Your program is probably spending most of its time doing those unnecessary copies in

experiments = np.append(experiments, attempts)

EDIT

As expected, replacing the quadratic-esque np.append() with a predefined array makes the wrapper function approximately the same speed.

Replacing the list of obtained stickers with a set makes things a little faster.

However, the bottleneck is a slow random number generator. Running through cProfile reveals that 75% of the execution time is spent in randint().

See below the code for the results (on my machine).

import random
import statistics
import timeit

import numpy as np

collectible = 20
rep_experiment = 10000


def original_collect_stickers():
    obtained = []
    attempts = 0

    while len(obtained) < collectible:
        new_sticker = random.randint(1, collectible)
        if new_sticker not in obtained:
            obtained.append(new_sticker)
        attempts += 1
    return attempts


def set_collect_stickers():
    obtained = set()
    attempts = 0
    n = 0

    while n < collectible:
        new_sticker = random.randint(1, collectible)
        if new_sticker not in obtained:
            obtained.add(new_sticker)
            n += 1
        attempts += 1
    return attempts


def repeat_with_list(fn):
    experiments = []
    for i in range(rep_experiment):
        experiments.append(fn())
    return statistics.mean(experiments)


def repeat_with_numpy(fn):
    experiments = np.zeros(rep_experiment)
    for i in range(rep_experiment):
        experiments[i] = fn()
    return np.mean(experiments)


def time_fn(name, fn, n=3):
    time_taken = timeit.timeit(fn, number=n) / n
    result = fn()  # once more to get the result too
    print(f"{name:15}: {time_taken:.6f}, result {result}")


for wrapper in (repeat_with_list, repeat_with_numpy):
    for fn in (original_collect_stickers, set_collect_stickers):
        time_fn(f"{wrapper.__name__} {fn.__name__}", lambda: wrapper(fn))

The result is

repeat_with_list original_collect_stickers: 0.747183, result 71.7912
repeat_with_list set_collect_stickers: 0.688952, result 72.1002
repeat_with_numpy original_collect_stickers: 0.752644, result 72.0978
repeat_with_numpy set_collect_stickers: 0.685355, result 71.7515

EDIT 2

Using the fastrand library's pcg32bounded() generator, i.e. new_sticker = fastrand.pcg32bounded(collectible) makes things plenty fast:

repeat_with_list original_collect_stickers: 0.761186, result 72.0185
repeat_with_list set_collect_stickers: 0.690244, result 71.878
repeat_with_list set_collect_stickers_fastrand: 0.116410, result 71.9323
repeat_with_numpy original_collect_stickers: 0.759154, result 71.8604
repeat_with_numpy set_collect_stickers: 0.696563, result 71.5482
repeat_with_numpy set_collect_stickers_fastrand: 0.114212, result 71.6369

Upvotes: 3

Related Questions