Reputation: 379
I tried answering this using external sort, but interviewer replied that the complexity was to high n.n(log(n)) i.e. n square *logn. Is there a better alternative.
To simplify the question: Lets suppose we have 1000 elements to sort with space allocated for 100 elements only. What is the best algorithm that will take lesser time than the external sort.
Upvotes: 1
Views: 5055
Reputation: 27277
I don't know which external sort you (or the interviewer) meant, but
my suggestion is a 10-way (in your case) merge:
O(1)
O((n/max_mem) * (max_mem) log(max_mem)))
= O(n log(max_mem))
O(n log(n/max_mem))
using a minHeap or O(n^2/max_mem)
trivially (may be faster in practice)Concerning computation, this is O(n (log(max_mem)+log(n/max_mem)))
=O(n log(n))
Concerning disk I/O, if all merging is done in one pass, this is 2*n
reads and 2*n
writes only.
More generally, it's (1+[depth of the merge tree])*n
All writes are sequential. The first read is sequential, the second one is sequential, interleaved from 10 files.
If there was much more data, you'd need repeated or recursive merge (100 per chunk, then pick N chunks repeatedly). At this point, it's worth replacing the split+sort step with Replacement/Selection as described in the @amit's answer, especially when the data is already almost sorted (you may evade the merging step completely).
Note that higher N may increase computation (very slightly, if you use the right structures), but reduces the amount of disk I/O significantly (up to a certain amount; if you merge too many chunks at once, you may run out of memory for the read buffers, causing unneccessary reads) . Disk I/O is expensive, CPU cycles are not.
Upvotes: 5
Reputation: 178411
The standard way of doing it is an External Sort.
In external sort - it is not only important to have O(nlogn)
comlexity - it is also critical to minimize as much as possible the disk reads/writes, and make the most reads and writes sequential (and not random) - since disk access is much more efficient when done sequentially.
The standard way of doing so is indeed a k-way merge sort, as suggsested by @JanDvorak, but there are some faults and addition to the suggestion I am aiming to correct:
k
- the "order" of the merge is M/(2b)
(where M is the size of your memory, and b
is the size of each "buffer" (usually disk block).b
entries from each "run" generated in previous iteration - filling M/2
in the memory. The rest of the memory is for "prediction" (which allows continious work with minimal wait for IO) - requesting more elements from a a run, and for the output buffer - in order to guarantee sequential right in blocks.log_k(N/(2M))
where k
is the number of runs (previously calculated), M
is the size of the memory, and N
is the size of the file. Each iteration requires 1 sequential read and 1 sequential write of the entire file.That said - the ratio of file_size/memory_size is usually MUCH more then 10. If you are interested only in a ratio of 10, a local optimizations might take place, but it is not for the more common case where file_size/memory_size >> 10
Upvotes: 3
Reputation: 38745
Perhaps the interviewer expected you to ask: Are those numbers the unique seven digit telephone numbers mentioned by J. Bentley (Cracking the Oyster)?
Upvotes: 3