Reputation: 97
I am trying to extract the k largest elements from a large array and compare the following 2 algorithms
T(n) = k + nlogk
T(n) = nlogn
But due to some reason Algorithm 2 is performing better. Can someone please explain why
import math
class MinHeap:
def __init__(self, arr):
self.__items = arr
self.build_heap()
def size(self):
return len(self.__items)
def items(self):
return self.__items
def build_heap(self):
for node_number in xrange(int(len(self.__items) / 2), 0, -1):
self.heapify(node_number)
def heapify(self, node_number):
# return if leave node
if node_number > int(len(self.__items) / 2):
return
node = self.__items[node_number-1]
left_child = self.__items[(2 * node_number)-1] if (((2 * node_number)-1) < len(self.__items)) else None
right_child = self.__items[(2 * node_number + 1)-1] if (((2 * node_number + 1)-1) < len(self.__items)) else None
min_node = node
if left_child != None and right_child != None:
min_node = min(node, left_child, right_child)
elif left_child != None :
min_node = min(node, left_child)
elif right_child != None :
min_node = min(node, right_child)
if min_node == node:
return
elif left_child!=None and min_node == left_child:
self.__items[node_number - 1], self.__items[(2 * node_number)-1] = self.__items[(2 * node_number)-1], self.__items[node_number - 1]
self.heapify(2 * node_number)
elif right_child!=None and min_node == right_child:
self.__items[node_number - 1], self.__items[(2 * node_number + 1)-1] = self.__items[(2 * node_number + 1)-1], self.__items[node_number - 1]
self.heapify(2 * node_number + 1)
def extract_min(self):
length = len(self.__items)
if length == 0:
return
self.__items[0], self.__items[length-1] = self.__items[length-1], self.__items[0]
min_element = self.__items.pop()
self.heapify(1);
return min_element
def insert(self, num):
self.__items.append(num)
current_node = len(self.__items)
parent_node = int(current_node / 2)
while current_node > 1:
min_node = min(self.__items[current_node-1], self.__items[parent_node-1])
if min_node == self.__items[parent_node-1]:
break
self.__items[current_node-1], self.__items[parent_node-1] = self.__items[parent_node-1], self.__items[current_node-1]
current_node = parent_node
parent_node = int(current_node / 2)
# Comparing Algorithms ::::::::::::::::::
import time
import random
from min_heap import *
numbers = random.sample(range(100000), 50000)
k = 3
n = len(numbers)
# k Largest element using Heap T(n) = k + nlogk
start = time.time()
my_heap = MinHeap(numbers[:k+1])
for number in numbers[k+1:]:
my_heap.extract_min()
my_heap.insert(number)
data = sorted(my_heap.items(),reverse=True)
print data[:len(data)-1]
end = time.time()
print "Took {} seconds".format(end-start)
# k Largest element using sorting T(n) = nlogn
start = time.time()
sorted_arr = sorted(numbers, reverse=True)
print sorted_arr[:k]
end = time.time()
print "Took {} seconds".format(end-start)
This is the output that I am getting:
Algorithm 1
[99999, 99998, 99997]
Took 0.15064406395 seconds
Algorithm 2
[99999, 99998, 99997]
Took 0.0120780467987 seconds
Upvotes: 1
Views: 639
Reputation: 133995
The theoretical bounds are:
One of the problems with your heap selection method is that you're creating a heap of the entire set of items. You really only need to create a heap of the first k items. The algorithm is:
h = create a min heap of first k items // O(k)
for each remaining item
if item is larger than smallest item on heap
remove smallest item from heap
add new item to heap
The worst case for that algorithm is O(n log k). The idea is that in the worst case you end up doing n insertions and n removals from a heap of size k. In the average case, the number of insertions and removals is quite a bit smaller. It depends on the relationship of k to n. If k is much smaller than n, this method can be very fast.
In your code, you're creating a min-heap of the first k items, and then you have this loop:
for number in numbers[k+1:]:
my_heap.extract_min()
my_heap.insert(number)
That code will not reliably produce the correct result because you're unconditionally removing the minimum element. So if the heap already contains the top k elements, but you still have numbers left, it will replace a larger element by a smaller one. You need to modify your code:
for number in numbers[k+1:]:
if (number < my_heap.get_min()) // get_min returns the minimum element, but doesn't remove it
my_heap.extract_min()
my_heap.insert(number)
Not only will that give you the correct answer, it will reduce the number of insertions and removals you have to make in the average case, which will speed up your program.
Your heap implementation also is non-optimum. You'll save considerable time making your heapify
function iterative rather than recursive, and there are ample opportunities for you to optimize that function.
As I mentioned above, quick select is an O(n) algorithm. In general, quick select is faster when k is more than about 2% of n. So if you're selecting 100 items from a list of 1,000, quick select will almost certainly be faster. If you're selecting 100 from a list of 1,000,000, then the heap selection method should be faster.
Sorting shouldn't ever be faster than quick select, and it shouldn't be faster than heap selection unless you're selecting a very large percentage (more than 60%, I expect) of the items.
I did some fairly detailed analysis of this a few years back. See my blog post, When theory meets practice.
Upvotes: 1
Reputation: 7552
There are two parts to the answer:
Practical aspect:
Calling user defined function is often slow, growing lists requires re-allocation of memory, etc. None of that counts in the "sort()
+ [:k]
" approach. Python code can be slow if the author doesn't pay attention to the details of memory management. So you're measuring differences in implementation AND differences in the algorithm.
Theoretical aspect:
I don't know how you came up with T(n) = k + nlogk
for the min heap method, but I see you calling my_heap.insert()
and my_heap.extract_min()
(triggering heapify()
) almost n
times. Both are logarithmic in n
, so it looks more like nlog(n)
in total.
Here's the output of a profiler run:
$ python -m cProfile -s cumtime h.py | head -n 20
[99999, 99998, 99997]
Took 0.255105018616 seconds
[99999, 99998, 99997]
Took 0.0103080272675 seconds
800007 function calls (750010 primitive calls) in 0.291 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.022 0.022 0.291 0.291 h.py:1(<module>)
49996 0.034 0.000 0.144 0.000 h.py:47(extract_min)
99995/49998 0.088 0.000 0.103 0.000 h.py:17(heapify)
49996 0.076 0.000 0.090 0.000 h.py:56(insert)
1 0.018 0.018 0.021 0.021 random.py:293(sample)
149975 0.015 0.000 0.015 0.000 {min}
299987 0.014 0.000 0.014 0.000 {len}
2 0.010 0.005 0.010 0.005 {sorted}
49996 0.004 0.000 0.004 0.000 {method 'pop' of 'list' objects}
49996 0.003 0.000 0.003 0.000 {method 'append' of 'list' objects}
50000 0.003 0.000 0.003 0.000 {method 'random' of '_random.Random' objects}
Upvotes: 0