dsimcha
dsimcha

Reputation: 68740

Efficient Out-Of-Core Sorting

I'm trying to work out how to efficiently sort a huge dataset that won't fit in memory. The obvious answer at a high level is to sort a whole bunch of chunks that do fit in memory using some standard algorithm, write these out to disk, and then merge them. Merging them is the problem.

Let's say the data divides up into C chunks, so I have C files to merge. If I do a C-way merge in one pass, then technically I have an O(N^2) algorithm, though one that only has to perform O(N) writes to disk. If I iteratively merge them into C/2 files, then C/4 files, etc. then I have an O(N log N) algorithm, but one that has to perform O(N log N) writes to disk, and therefore has a huge constant term.

What is the typical solution to this conundrum? Is there any good one?

Upvotes: 17

Views: 5027

Answers (7)

Jerry Coffin
Jerry Coffin

Reputation: 490018

The simple answer is that there is no simple answer to this question. There are lots of answers, most of them fairly complex -- Knuth volume 3 (for one example) devotes a great deal of space to it.

One thing that becomes obvious when looking through what's been done is that you really want to minimize the number of runs you create during your initial sorting, and maximize the length of each. To do that, you generally want to read in about as much data as you can fit in memory, but instead of just sorting it and writing it out, you want to put it into a heap. Then as you write each record out, you read IN another record.

You then check whether that record would sort before or after the record you just wrote out. If you would sort after it, you insert it into your heap, and continue. If it would sort before, you insert it into a second heap.

You stop adding records to the current run when the first heap is completely empty, and your second heap is taking up all your memory. At that point, you repeat the process, writing a new run to a new file.

This will usually produce considerably longer intermediate runs in the initial phase, so merging them is substantially less work. Assuming the input records are in random order, you can expect this to approximately double the length of each run--but if the input is even partially sorted, this can take advantage of that existing ordering to extend the run lengths even more.

As an aside, I certainly didn't invent this -- I probably first read about it in Knuth, but perhaps in Algorithms + Data Structures = Programs (Niklaus Wirth) -- both discuss it. Knuth credits first publication of the method to "H. Seward", in his masters thesis at MIT in 1954. If you have the second edition of Knuth, it's on page 254 of volume 3. I don't have a copy of the third edition, so I don't have a page number for that.

Upvotes: 20

brianegge
brianegge

Reputation: 29852

Are you sorting in place or creating a new copy? If you are sorting in place, then memory mapped IO is usually a good option. Just map your entire file, and perform a merge sort on it. The OS will keep as much of the file in memory, and depending on the data set, will generally minimize your IO.

If you do write your own sorting algorithm, one trick is to reverse your direction after each pass. So, if one your first pass, you start from beginning to end, then go from end to beginning on your second pass. If split your files into parts A, B, C, and D, then after sorting C and D, you should merge C and D, and not go back to A and B. The reason of course is your OS will page parts of the files into memory, and you want to use the cache as much as possible.

Upvotes: -1

Matthieu M.
Matthieu M.

Reputation: 299730

It's funny as I heard this same question not a month ago... and the response that our local guru gave as well.

"Use the unix sort command"

Though we admitedly thought it was a joke at the expense of the asker... it turns out that it was not. The reasoning is that those smart guys already gave a lot of thought in how to solve the problem of very large files, and came up with a very impressive implementation which makes good use of the available resources.

Therefore, unless you plan in re-inventing the wheel: ie you have time and this is business critical, then simply using the unix sort is probably an excellent idea.

The only drawback is its arcane syntax. This page is dedicated to the command and various explanations.

My personal advise: take a small sample of the data for testing that the command effectively does exactly what you want.

Upvotes: 7

Nick Dandoulakis
Nick Dandoulakis

Reputation: 43110

A good solution is external sorting. Specifically check out the external mergesort algorithm.

External sorting is a term for a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead they must reside in the slower external memory (usually a hard drive). The typical external sorting algorithm uses a sort-merge strategy, which starts by sorting small subfiles. The basic algorithm consist of two phases: the sorting phase and the merging phase. In the sorting phase, the subfiles can fit in the available buffer space are read into main memory, sorted using an internal sorting algorithm, and written back to disk as temporary sorted subfiles. In the merging phase, the sorted subfiles are merged during one or more passes.

Upvotes: 5

Keith Randall
Keith Randall

Reputation: 23265

Nick is right, use external sorting. Your C-way merge doesn't imply O(N^2), by the way. Use a priority queue for the merge and it is still O(N lg N).

You might also look at cache oblivious algorithms for sorting.

Upvotes: 1

S.Lott
S.Lott

Reputation: 391820

Why aren't you using the algorithms in http://www.amazon.com/Art-Computer-Programming-Sorting-Searching/dp/0201896850 ?

They're quite good, and carefully explained.

Upvotes: 0

zildjohn01
zildjohn01

Reputation: 11515

Why not look at the problem from a different perspective. For example, if you happen to be sorting names, make a pass, sorting anything that begins with A-F, a second pass sorting strings that begin with G-M, etc. Then the results can simply be appended in order. The disadvantage is that the data must be read from disk C times.

Upvotes: 1

Related Questions