FLab
FLab

Reputation: 7476

Numpy's partition slower than sort for small arrays

I was looking for an efficient way to calculate the nth largest value in a numpy array and this answer lead me to np.partition.

By the way, I have noticed that the naive sorting is faster than np.partition approach for array shorter than 100 entries. (For large arrays, instead, the gain is evident)

What is the reason why np.partition run time is almost flat for small arrays?

enter image description here

Code to generate picture:

import pandas as pd
import numpy as np

import timeit

def func_1(inp):
    return np.partition(inp, 10)[10]

def func_2(inp):
    return np.sort(inp)[10]

a = []
b = []

N_tests = int(1e5)

for wdw in range(20, 1000, 10):

    print wdw

    res1 = timeit.timeit("func_1(test)",
                      setup = "import pandas as pd; import numpy as np; wdw_size = %d; test = np.random.randn(wdw_size); from __main__ import func_1"%wdw, number = N_tests)

    a.append(res1)

    res2 = timeit.timeit("func_2(test)",
                      setup = "import pandas as pd; import numpy as np; wdw_size = %d; test = np.random.randn(wdw_size); from __main__ import func_2"%wdw, number = N_tests)

    b.append(res2)

import matplotlib.pyplot as plt
plt.plot(range(20,1000, 10), a, range(20, 1000, 10), b)
plt.legend(['np.partition', 'np.sort'])
plt.xlabel('Array Size')
plt.ylabel('Time')

Upvotes: 6

Views: 1429

Answers (1)

Mike JS Choi
Mike JS Choi

Reputation: 1151

According to the docs, np.partition is implemented via Introselect - an algorithm with a worst-case performance of O(n).

In a sentence, Introselect is a souped up version of Quick sort with a little help from median of medians.

On the otherhand, np.sort is implemented using the plain old Quick Sort, which has a worst-case performance of O(n^2).

So to compare the two, while np.sort only uses Quick sort and may end up with a O(n^2) as its worst-case performance, np.partition can avoid this by falling back on median of medians when necessary to always ensure a O(n).


Not entirely sure but maybe np.sort is faster for small arrays just because np.partition has a larger overhead due to its more complex algorithm.

Upvotes: 5

Related Questions