Reputation: 1660
In my project, there is a very large list.
The most common operation on this list is to get the biggest n
elements.
The n
is fixed or rarely changed throughout the whole lifetime. Which algorithm should I use in order to do this efficiently?
It means that what should I do when inserting, updating, or deleting an element in the list, and what should I do when getting top n elements from the list.
There is a solution (maybe not that good):
quicksort
or another sort algorithm. Because the list is too large, this step maybe too slow.Is there any better solution?
Upvotes: 2
Views: 2015
Reputation: 2699
in C++ STL. Your best bet is to used an std::set.
Every time you add an element it will be ordered. Then you can extract the n last element of std::set
Upvotes: 0
Reputation: 134005
So you have a list of n
items, and you want to pick the k
largest. One way to do this is with a min-heap of size k
. The resulting algorithm is O(n log k).
Start by creating an empty min-heap of the first k
items. Then, for each following item in the list, if it's larger than the smallest item on the heap, remove the smallest item on the heap and replace it with the new item. When you're done, the largest k
items will be on the heap. Pseudo code looks like this:
// assume an array a[], with length n.
// k is the number of largest items you want.
heap = new min-heap
// add first k items to the heap
for (i = 0; i < k; ++i)
heap.add(a[i])
for (i = k; i < n; ++i)
if (a[i] > heap.peek())
heap.removeMin()
heap.add(a[i])
// at this point, the largest k items are on the min-heap
This technique works well when k is a small percentage of n. In that case, it requires little memory. The algorithm has a worst case running time of O(n log k), but it's highly dependent on the order of items in the list. The worst case is when the array is sorted in ascending order. The best case is when the array is sorted in descending order. In the average case, much fewer than 50% of items get added and removed from the heap.
Another algorithm, Quickselect, has complexity O(n), but is slower than the heap selection method when k is a small percentage (1 or 2%) of n. Quickselect also modifies the existing list, which might not be something you want.
See my blog post, https://blog.mischel.com/2011/10/25/when-theory-meets-practice/, for more details.
You can do a few things here to speed up your average time by maintaining a heap rather than rebuilding it for every query.
The result, then, is that you can maintain the heap at little cost: potentially updating it whenever a new item is added, but only if it is larger than one of the top 30 (or whatever your max is) items. The only time you have to rebuild is when asked for the top k items after a deletion.
Come to think of it, you only have to mark the heap as dirty if the item you delete is greater than or equal to the smallest item on the heap. Also, if the heap is marked as dirty, then you can forego any further update on insertion or deletion because you have to rebuild the heap anyway the next time you get a query.
Upvotes: 6
Reputation: 22890
if n <<<< size(list)
then use a hashtable for the main elements, and a companion data structure to store the n biggest elements. The companion data structure is updated during insertion and deletion, and its used to query the biggest elements.
If n is 30, a sorted array is sufficient.
Disclaimer : This approach perform poorly if biggest elements are often removed. Deletion of a biggest element would requires a sequential scan of the whole hashtable.
Upvotes: 0
Reputation:
A (balanced) binary search tree is your best friend. Insertions, deletions, search for the k-th all in time O(Log N).
If the data resides in external memory, then a B-tree or similar.
Upvotes: 1