MoonFruit
MoonFruit

Reputation: 1660

Which algorithm should use when get the biggest n elements from a large list?

In my project, there is a very large list. The most common operation on this list is to get the biggest n elements. The n is fixed or rarely changed throughout the whole lifetime. Which algorithm should I use in order to do this efficiently?

It means that what should I do when inserting, updating, or deleting an element in the list, and what should I do when getting top n elements from the list.

There is a solution (maybe not that good):

  1. After inserting, updating, or deleting an element, sort the list with quicksort or another sort algorithm. Because the list is too large, this step maybe too slow.
  2. When getting top n elements, get the first n elements from the list.

Is there any better solution?

Upvotes: 2

Views: 2015

Answers (4)

PilouPili
PilouPili

Reputation: 2699

in C++ STL. Your best bet is to used an std::set.

Every time you add an element it will be ordered. Then you can extract the n last element of std::set

Upvotes: 0

Jim Mischel
Jim Mischel

Reputation: 134005

So you have a list of n items, and you want to pick the k largest. One way to do this is with a min-heap of size k. The resulting algorithm is O(n log k).

Start by creating an empty min-heap of the first k items. Then, for each following item in the list, if it's larger than the smallest item on the heap, remove the smallest item on the heap and replace it with the new item. When you're done, the largest k items will be on the heap. Pseudo code looks like this:

// assume an array a[], with length n.
// k is the number of largest items you want.
heap = new min-heap

// add first k items to the heap
for (i = 0; i < k; ++i)
    heap.add(a[i])
for (i = k; i < n; ++i)
    if (a[i] > heap.peek())
        heap.removeMin()
        heap.add(a[i])
// at this point, the largest k items are on the min-heap

This technique works well when k is a small percentage of n. In that case, it requires little memory. The algorithm has a worst case running time of O(n log k), but it's highly dependent on the order of items in the list. The worst case is when the array is sorted in ascending order. The best case is when the array is sorted in descending order. In the average case, much fewer than 50% of items get added and removed from the heap.

Another algorithm, Quickselect, has complexity O(n), but is slower than the heap selection method when k is a small percentage (1 or 2%) of n. Quickselect also modifies the existing list, which might not be something you want.

See my blog post, https://blog.mischel.com/2011/10/25/when-theory-meets-practice/, for more details.

You can do a few things here to speed up your average time by maintaining a heap rather than rebuilding it for every query.

  1. If the number of items you want will always be less than 30, then maintain a heap of 30 items all the time. If the user just wants the top 10, then you can pick just those from the heap.
  2. When an item is added to the list, check to see if it is larger than the smallest item on the heap. If it is, replace the smallest item.
  3. When an item is deleted, mark the heap as dirty.
  4. When asked for the top k items, if the heap is dirty then you have to rebuild it. Otherwise, you can just copy the contents of the heap to a scratch array, sort it, and return the k items that were asked for. Of course, once you rebuild the heap, clear the dirty flag.

The result, then, is that you can maintain the heap at little cost: potentially updating it whenever a new item is added, but only if it is larger than one of the top 30 (or whatever your max is) items. The only time you have to rebuild is when asked for the top k items after a deletion.

Come to think of it, you only have to mark the heap as dirty if the item you delete is greater than or equal to the smallest item on the heap. Also, if the heap is marked as dirty, then you can forego any further update on insertion or deletion because you have to rebuild the heap anyway the next time you get a query.

Upvotes: 6

UmNyobe
UmNyobe

Reputation: 22890

if n <<<< size(list) then use a hashtable for the main elements, and a companion data structure to store the n biggest elements. The companion data structure is updated during insertion and deletion, and its used to query the biggest elements. If n is 30, a sorted array is sufficient.

Disclaimer : This approach perform poorly if biggest elements are often removed. Deletion of a biggest element would requires a sequential scan of the whole hashtable.

Upvotes: 0

user1196549
user1196549

Reputation:

A (balanced) binary search tree is your best friend. Insertions, deletions, search for the k-th all in time O(Log N).


If the data resides in external memory, then a B-tree or similar.

Upvotes: 1

Related Questions