Anshu Kandhari
Anshu Kandhari

Reputation: 379

Interview puzzle: Sorting a million number input with limited memory

I tried answering this using external sort, but interviewer replied that the complexity was to high n.n(log(n)) i.e. n square *logn. Is there a better alternative.

To simplify the question: Lets suppose we have 1000 elements to sort with space allocated for 100 elements only. What is the best algorithm that will take lesser time than the external sort.

Upvotes: 1

Views: 5055

Answers (3)

John Dvorak
John Dvorak

Reputation: 27277

I don't know which external sort you (or the interviewer) meant, but

my suggestion is a 10-way (in your case) merge:

  • Split the file into chunks of MAX_MEM size (100 elements)
    • this is O(1)
  • Sort each chunk in memory and store as a separate file
    • this is O((n/max_mem) * (max_mem) log(max_mem))) = O(n log(max_mem))
  • Open all chunks as streams of elements
  • Merge all streams by selecting the lowest element at each step.
    • this is O(n log(n/max_mem)) using a minHeap or O(n^2/max_mem) trivially (may be faster in practice)
  • Delete the chunks

Concerning computation, this is O(n (log(max_mem)+log(n/max_mem)))=O(n log(n))

Concerning disk I/O, if all merging is done in one pass, this is 2*n reads and 2*n writes only. More generally, it's (1+[depth of the merge tree])*n

All writes are sequential. The first read is sequential, the second one is sequential, interleaved from 10 files.

If there was much more data, you'd need repeated or recursive merge (100 per chunk, then pick N chunks repeatedly). At this point, it's worth replacing the split+sort step with Replacement/Selection as described in the @amit's answer, especially when the data is already almost sorted (you may evade the merging step completely).

Note that higher N may increase computation (very slightly, if you use the right structures), but reduces the amount of disk I/O significantly (up to a certain amount; if you merge too many chunks at once, you may run out of memory for the read buffers, causing unneccessary reads) . Disk I/O is expensive, CPU cycles are not.

Upvotes: 5

amit
amit

Reputation: 178411

The standard way of doing it is an External Sort.

In external sort - it is not only important to have O(nlogn) comlexity - it is also critical to minimize as much as possible the disk reads/writes, and make the most reads and writes sequential (and not random) - since disk access is much more efficient when done sequentially.

The standard way of doing so is indeed a k-way merge sort, as suggsested by @JanDvorak, but there are some faults and addition to the suggestion I am aiming to correct:

  1. First, doing an RS (Replacement-Selection) on the input decreases the number of initial "runs" (number of increasing sequences) and thus usually decrease the total number of iterations needed by the later on merge sort.
  2. We need memory for buffering (reading and writing input) - thus, for memory size M, and file size M*10, we cannot do 10-way merge - it will result in a LOT of read disks (reading each element, rather then in blocks).
    The standard formula for k - the "order" of the merge is M/(2b) (where M is the size of your memory, and b is the size of each "buffer" (usually disk block).
  3. Each merge sort step is done by reading b entries from each "run" generated in previous iteration - filling M/2 in the memory. The rest of the memory is for "prediction" (which allows continious work with minimal wait for IO) - requesting more elements from a a run, and for the output buffer - in order to guarantee sequential right in blocks.
  4. Total number of iterations with this approach is log_k(N/(2M)) where k is the number of runs (previously calculated), M is the size of the memory, and N is the size of the file. Each iteration requires 1 sequential read and 1 sequential write of the entire file.

That said - the ratio of file_size/memory_size is usually MUCH more then 10. If you are interested only in a ratio of 10, a local optimizations might take place, but it is not for the more common case where file_size/memory_size >> 10

Upvotes: 3

Ekkehard.Horner
Ekkehard.Horner

Reputation: 38745

Perhaps the interviewer expected you to ask: Are those numbers the unique seven digit telephone numbers mentioned by J. Bentley (Cracking the Oyster)?

Upvotes: 3

Related Questions