Reputation: 7994

sorting huge file that is almost sorted

I'm facing with the following problem:

I have a huge file (let's say 30 GB), that is streamed in memory with a specific API.
- This API only allows me to read going forward (not backward). But the files can be read as many times as I want.
- The file contains data that is almost all sorted, as in, 99% of the data is sorted but it can happen that a record is not in its correct position and should have been inserted much before if everything was sorted.

I'm trying to create a duplicate of this file, except it would need to be sorted.

Is there a graceful way to do this ?

The only way I can think of is the most generic way:

read the file
create batch of a few GB of memory, sort them, write them to a file on the HDD
use external merge to merge all these temporary files into the final output

However this is not using the specificities that the data is "almost" sorted. Would there be a better way to do this ? For instance without using external files on the HDD?

Upvotes: 1

Answers (2)

rcgldr

Reputation: 28826

Use a variation of bottom up merge sort called natural merge sort. The idea here is to find runs of ordered data, then repeatedly merge those runs back and forth between two files (all sequential I/O) until there's only a single run left. If the sort doesn't have to be stable (preserve the order of equal elements), then you can consider a run boundary to occur whenever a pair of sequential elements are out of order. This eliminates some housekeeping. If the sort needs to be stable, then you need to keep track of run boundaries on the initial pass that finds the runs, this could be an array of counts (the size of each run). Hopefully this array would fit in memory. After each merge pass, the number of counts in the array is cut in half, and once there's only a single count, the sort is done.

Wiki article (no sample code given though): natural bottom up merge sort .

If all the out of order elements consist of somewhat isolated records, you could separate the out of order elements into a third file, only copying in order records from the first file to the second file. Then you sort the third file with any method you want (bottom up merge sort is probably still best if the third file is large), then merge the second and third files to create a sorted file.

If you have multiple hard drives, keep the files on separate drives. If doing this on a SSD drive, it won't matter. If using a single hard drive, reading or writing a large number of records at a time, like 10MB to 100MB per read or write, will greatly reduce the seek overhead during the sort process.

Upvotes: 1

6502

Reputation: 114481

You could do this (example in Python)

last = None
special = []
for r in records:
    if last is None or r > last:
        last = r
    else:
        special.append(r)
        if len(special) > max_memory:
            break
if len(special) > max_memory:
    # too many out of sequence records, use a regular sort
    ...
else:
    sort(special)
    i = 0
    for r in records:
        while i < len(special) and special[i] < r:
            write(special[i])
            i += 1
        write(r)
    while i < len(special):
        write(special[i])
        i += 1

Upvotes: 1

sorting huge file that is almost sorted

Answers (2)

Related Questions