lezebulon
lezebulon

Reputation: 7994

sorting huge file that is almost sorted

I'm facing with the following problem:

I'm trying to create a duplicate of this file, except it would need to be sorted.

Is there a graceful way to do this ?

The only way I can think of is the most generic way:

However this is not using the specificities that the data is "almost" sorted. Would there be a better way to do this ? For instance without using external files on the HDD?

Upvotes: 1

Views: 302

Answers (2)

rcgldr
rcgldr

Reputation: 28826

Use a variation of bottom up merge sort called natural merge sort. The idea here is to find runs of ordered data, then repeatedly merge those runs back and forth between two files (all sequential I/O) until there's only a single run left. If the sort doesn't have to be stable (preserve the order of equal elements), then you can consider a run boundary to occur whenever a pair of sequential elements are out of order. This eliminates some housekeeping. If the sort needs to be stable, then you need to keep track of run boundaries on the initial pass that finds the runs, this could be an array of counts (the size of each run). Hopefully this array would fit in memory. After each merge pass, the number of counts in the array is cut in half, and once there's only a single count, the sort is done.

Wiki article (no sample code given though): natural bottom up merge sort .

If all the out of order elements consist of somewhat isolated records, you could separate the out of order elements into a third file, only copying in order records from the first file to the second file. Then you sort the third file with any method you want (bottom up merge sort is probably still best if the third file is large), then merge the second and third files to create a sorted file.

If you have multiple hard drives, keep the files on separate drives. If doing this on a SSD drive, it won't matter. If using a single hard drive, reading or writing a large number of records at a time, like 10MB to 100MB per read or write, will greatly reduce the seek overhead during the sort process.

Upvotes: 1

6502
6502

Reputation: 114481

You could do this (example in Python)

last = None
special = []
for r in records:
    if last is None or r > last:
        last = r
    else:
        special.append(r)
        if len(special) > max_memory:
            break
if len(special) > max_memory:
    # too many out of sequence records, use a regular sort
    ...
else:
    sort(special)
    i = 0
    for r in records:
        while i < len(special) and special[i] < r:
            write(special[i])
            i += 1
        write(r)
    while i < len(special):
        write(special[i])
        i += 1

Upvotes: 1

Related Questions