Reputation: 7476
I was looking for an efficient way to calculate the nth largest value in a numpy array and this answer lead me to np.partition.
By the way, I have noticed that the naive sorting is faster than np.partition approach for array shorter than 100 entries. (For large arrays, instead, the gain is evident)
What is the reason why np.partition run time is almost flat for small arrays?
Code to generate picture:
import pandas as pd
import numpy as np
import timeit
def func_1(inp):
return np.partition(inp, 10)[10]
def func_2(inp):
return np.sort(inp)[10]
a = []
b = []
N_tests = int(1e5)
for wdw in range(20, 1000, 10):
print wdw
res1 = timeit.timeit("func_1(test)",
setup = "import pandas as pd; import numpy as np; wdw_size = %d; test = np.random.randn(wdw_size); from __main__ import func_1"%wdw, number = N_tests)
a.append(res1)
res2 = timeit.timeit("func_2(test)",
setup = "import pandas as pd; import numpy as np; wdw_size = %d; test = np.random.randn(wdw_size); from __main__ import func_2"%wdw, number = N_tests)
b.append(res2)
import matplotlib.pyplot as plt
plt.plot(range(20,1000, 10), a, range(20, 1000, 10), b)
plt.legend(['np.partition', 'np.sort'])
plt.xlabel('Array Size')
plt.ylabel('Time')
Upvotes: 6
Views: 1429
Reputation: 1151
According to the docs, np.partition
is implemented via Introselect - an algorithm with a worst-case performance of O(n).
In a sentence, Introselect is a souped up version of Quick sort with a little help from median of medians.
On the otherhand, np.sort
is implemented using the plain old Quick Sort, which has a worst-case performance of O(n^2).
So to compare the two, while np.sort
only uses Quick sort and may end up with a O(n^2) as its worst-case performance, np.partition
can avoid this by falling back on median of medians when necessary to always ensure a O(n).
Not entirely sure but maybe np.sort
is faster for small arrays just because np.partition
has a larger overhead due to its more complex algorithm.
Upvotes: 5