Aakash
Aakash

Reputation: 97

K-Largest elements Algorithm Comparison

I am trying to extract the k largest elements from a large array and compare the following 2 algorithms

  1. Min Heap Method. T(n) = k + nlogk
  2. Sorting Method. T(n) = nlogn

But due to some reason Algorithm 2 is performing better. Can someone please explain why

import math


class MinHeap:
    def __init__(self, arr):
        self.__items = arr
        self.build_heap()

    def size(self):
        return len(self.__items)
def items(self):
    return self.__items

def build_heap(self):
    for node_number in xrange(int(len(self.__items) / 2), 0, -1):
        self.heapify(node_number)

def heapify(self, node_number):
    # return if leave node
    if node_number > int(len(self.__items) / 2):
        return

    node = self.__items[node_number-1]
    left_child = self.__items[(2 * node_number)-1] if (((2 * node_number)-1) < len(self.__items)) else None 

    right_child = self.__items[(2 * node_number + 1)-1] if (((2 * node_number + 1)-1) < len(self.__items)) else None        

    min_node = node

    if left_child != None and right_child != None:
        min_node = min(node, left_child, right_child)
    elif left_child != None :
        min_node = min(node, left_child)
    elif right_child != None :
        min_node = min(node, right_child)

    if min_node == node:
        return
    elif left_child!=None and min_node == left_child:

        self.__items[node_number - 1], self.__items[(2 * node_number)-1] = self.__items[(2 * node_number)-1], self.__items[node_number - 1]
        self.heapify(2 * node_number)
    elif right_child!=None and min_node == right_child:

        self.__items[node_number - 1], self.__items[(2 * node_number + 1)-1] = self.__items[(2 * node_number + 1)-1], self.__items[node_number - 1]
        self.heapify(2 * node_number + 1)

def extract_min(self):
    length = len(self.__items)
    if length == 0:
        return
    self.__items[0], self.__items[length-1] = self.__items[length-1], self.__items[0]
    min_element =  self.__items.pop()
    self.heapify(1);
    return min_element

def insert(self, num):
    self.__items.append(num)
    current_node = len(self.__items)
    parent_node = int(current_node / 2)
    while current_node > 1:
        min_node = min(self.__items[current_node-1], self.__items[parent_node-1])
        if min_node == self.__items[parent_node-1]:
            break
        self.__items[current_node-1], self.__items[parent_node-1] = self.__items[parent_node-1], self.__items[current_node-1]
        current_node = parent_node
        parent_node = int(current_node / 2)

# Comparing Algorithms ::::::::::::::::::
import time
import random
from min_heap import *

numbers = random.sample(range(100000), 50000)

k = 3
n = len(numbers)

# k Largest element using Heap T(n) = k + nlogk
start = time.time()
my_heap = MinHeap(numbers[:k+1])

for number in numbers[k+1:]:
    my_heap.extract_min()
    my_heap.insert(number)

data =  sorted(my_heap.items(),reverse=True)
print data[:len(data)-1]
end = time.time()
print "Took {} seconds".format(end-start)


# k Largest element using sorting T(n) = nlogn
start = time.time()
sorted_arr = sorted(numbers, reverse=True)
print sorted_arr[:k]
end = time.time()
print "Took {} seconds".format(end-start)

This is the output that I am getting:

Algorithm 1

[99999, 99998, 99997]
Took 0.15064406395 seconds

Algorithm 2

[99999, 99998, 99997]
Took 0.0120780467987 seconds

Upvotes: 1

Views: 639

Answers (2)

Jim Mischel
Jim Mischel

Reputation: 133995

The theoretical bounds are:

  • Sorting: O(n log n)
  • heap selection: O(n log k)
  • quick select: O(n)

One of the problems with your heap selection method is that you're creating a heap of the entire set of items. You really only need to create a heap of the first k items. The algorithm is:

h = create a min heap of first k items // O(k)
for each remaining item
    if item is larger than smallest item on heap
        remove smallest item from heap
        add new item to heap

The worst case for that algorithm is O(n log k). The idea is that in the worst case you end up doing n insertions and n removals from a heap of size k. In the average case, the number of insertions and removals is quite a bit smaller. It depends on the relationship of k to n. If k is much smaller than n, this method can be very fast.

In your code, you're creating a min-heap of the first k items, and then you have this loop:

for number in numbers[k+1:]:
    my_heap.extract_min()
    my_heap.insert(number)

That code will not reliably produce the correct result because you're unconditionally removing the minimum element. So if the heap already contains the top k elements, but you still have numbers left, it will replace a larger element by a smaller one. You need to modify your code:

for number in numbers[k+1:]:
    if (number < my_heap.get_min())  // get_min returns the minimum element, but doesn't remove it
        my_heap.extract_min()
        my_heap.insert(number)

Not only will that give you the correct answer, it will reduce the number of insertions and removals you have to make in the average case, which will speed up your program.

Your heap implementation also is non-optimum. You'll save considerable time making your heapify function iterative rather than recursive, and there are ample opportunities for you to optimize that function.

As I mentioned above, quick select is an O(n) algorithm. In general, quick select is faster when k is more than about 2% of n. So if you're selecting 100 items from a list of 1,000, quick select will almost certainly be faster. If you're selecting 100 from a list of 1,000,000, then the heap selection method should be faster.

Sorting shouldn't ever be faster than quick select, and it shouldn't be faster than heap selection unless you're selecting a very large percentage (more than 60%, I expect) of the items.

I did some fairly detailed analysis of this a few years back. See my blog post, When theory meets practice.

Upvotes: 1

Pavel
Pavel

Reputation: 7552

There are two parts to the answer:

  1. Practical aspect:

    Calling user defined function is often slow, growing lists requires re-allocation of memory, etc. None of that counts in the "sort() + [:k]" approach. Python code can be slow if the author doesn't pay attention to the details of memory management. So you're measuring differences in implementation AND differences in the algorithm.

  2. Theoretical aspect:

    I don't know how you came up with T(n) = k + nlogk for the min heap method, but I see you calling my_heap.insert() and my_heap.extract_min() (triggering heapify()) almost n times. Both are logarithmic in n, so it looks more like nlog(n) in total.


Here's the output of a profiler run:

$ python -m cProfile -s cumtime h.py | head -n 20
[99999, 99998, 99997]
Took 0.255105018616 seconds
[99999, 99998, 99997]
Took 0.0103080272675 seconds
         800007 function calls (750010 primitive calls) in 0.291 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.022    0.022    0.291    0.291 h.py:1(<module>)
    49996    0.034    0.000    0.144    0.000 h.py:47(extract_min)
99995/49998  0.088    0.000    0.103    0.000 h.py:17(heapify)
    49996    0.076    0.000    0.090    0.000 h.py:56(insert)
        1    0.018    0.018    0.021    0.021 random.py:293(sample)
   149975    0.015    0.000    0.015    0.000 {min}
   299987    0.014    0.000    0.014    0.000 {len}
        2    0.010    0.005    0.010    0.005 {sorted}
    49996    0.004    0.000    0.004    0.000 {method 'pop' of 'list' objects}
    49996    0.003    0.000    0.003    0.000 {method 'append' of 'list' objects}
    50000    0.003    0.000    0.003    0.000 {method 'random' of '_random.Random' objects}

Upvotes: 0

Related Questions