Reputation: 25812
ok, let's say we have 100 billion
items that need to be sorted.
Our memory is big enough for these items.
Can we still use List.sort (Merge sort) to sort them?
My concerns have two parts:
mergesort
need become a concern in this case?100 billion
items during sorting, will this become an disadvantage? in terms of performance?For sorting 100 billion
items, should I use array
in this case?
Upvotes: 1
Views: 196
Reputation: 31459
The standard mergesort implementation is clever in not reallocating too much memory (the split-in-half at the beginning allocates no new memory). Given an input list of n
conses, it will allocate n * log(n)
list conses in the worst case (with an essentially identical best case). Given that the values of the elements themselves will be shared between the input, intermediary and output lists, you will only allocate 3 words by list cons, which means that the sort will allocate 3 * n * log(n)
words in memory in total (for n = 100 billion
, 3 * log(n)
is 110
, which is quite a huge constant factor).
On the other hand, garbage collection can collect some of that memory: the worst-case memory usage is total live memory, not total allocated memory. In fact, the intermediary lists built during the log(n)
layers of recursive subcalls can be collected before any result is returned (they become dead at the same rate the final merge
allocates new cells), so this algorithm keeps n
additional live cons cells in the worst case, which means only 3*n
words or 24*n
bytes. For n = 100 billion
, that means 2.4 additional terabytes of memory, just as much as you needed to store the spine of the input list in the first place.
Finally, if you keep no reference to the input list itself, you can collect the first half of it as soon as it is sorted, giving you a n/2
worst-case bound instead of n
. And you can collect the first half of this first half while you sort the first half, giving you a n/4
worst-case bound instead of n/2
. Going to the limit with this reasoning, I believe that with enough GC work, you can in fact sort the list entirely in place -- modulo some constant-size memory pool for a stop© first generation GC, whose size will impact the time performance of the algorithm.
Upvotes: 2