Reputation: 772

Linux: sorting a 500GB text file with 10^10 records

I have a 500GB text file with about 10 billions rows that needs to be sorted in alphabetical order. What is the best algorithm to use? Can my implementation & set-up be improved ?

For now, I am using the coreutils sort command:

LANG=C
sort -k2,2 --field-separator=',' --buffer-size=(80% RAM) --temporary-directory=/volatile BigFile

I am running this in AWS EC2 on a 120GB RAM & 16 cores virtual machine. It takes the most part of the day.

/volatile is a 10TB RAID0 array.

The 'LANG=C' trick delivers a x2 speed gain (thanks to 1)

By default 'sort' uses 50% of the available RAM. Going up to 80-90% gives some improvement.

My understanding is that gnu 'sort' is a variant of the merge-sort algorithm with O(n log n), which is the fastest : see 2 & 3 . Would moving to QuickSort help (I'm happy with an unstable sort)?

One thing I have noticed is that only 8 cores are used. This is related to default_max_threads set to 8 in linux coreutils sort.c (See 4). Would it help to recompile sort.c with 16 ?

Thanks!

FOLLOW-UP :

@dariusz

I used Chris and your suggestions below.

As the data was already generated in batches: I sorted each bucket separately (on several separate machines) and then used the 'sort --merge' function. Works like a charm and is much faster: O(log N/K) vs O(log N).

I also rethinked the project from scratch: some data post-processing is now performed while the data is generated, so that some un-needed data (noise) can be discarded before sorting takes place.

All together, data size reduction & sort/merge led to massive reduction in computing resources needed to achieve my objective.

Thanks for all your helpful comments.

Upvotes: 12

Answers (3)

olegarch

Reputation: 3891

I think, you need perform that sort in 2 stages:

Split to trie -like buckets, fit into memory.
Iterate buckets according alphabeth order, fetch each, sort, and append to output file.

This is example.

Imagine, you have bucket limit 2 lines only, and your input file is:

infile: 0000 0001 0002 0003 5 53 52 7000

on the 1st iteration, you read your input file "super-bucket, with empty prefix", and split according 1st letter.

There would be 3 output files:

0: 000 001 002 003

5: (empty) 3 2

7: 000

As you see, bucket with filename/prefix 7 contains only one record 000, which is "7000", splitted to 7 - filename, and 000 - tail of the string. since this is just one record, wil do not need to split this file anymore. But, files "0" and "5" contains 4 and 3 records, what is more than limit 2. So, need split them again. After split:

00: 01 02 03

5: (empty)

52: (empty)

53: (empty)

7: 000

As you see, files with prefix "5" and "7" already splitted. so, need just split file "00".

As you see, after splitting, you will have set of relative small files. Thereafter, run 2nd stage:

Sort filenames, and process filenames according sorted order. sort each file, and append resut to output, with adding file name to output string.

Upvotes: 1

Dariusz

Reputation: 22241

Just an idea:

I assume the file contents are generated for quite a large amout of time. Write an application (script?) which would periodically move the up-till-now generated file to a different location, append its contents to another file, perform a sort on that different file, and repeat until all data is gathered.

That way your system would spend more time sorting, but the results would be available sooner, since sorting partially-sorted data will be faster than sorting the unsorted data.

Upvotes: 1

MobA11y

Reputation: 18860

The benefit of quicksort over mergesort is no additional memory overhead. The benefit of mergesort is the guaranteed O(n log n) run time, where as quicksort can be much worse in the event of poor pivot point sampling. If you have no reason to be concerned about the memory use, don't change. If you do, just ensure you pick a quicksort implementation that does solid pivot sampling.

I don't think it would help spectacularly to recompile sort.c. It might be, on a micro-optimization scale. But your bottleneck here is going to be memory/disk speed, not amount of processor available. My intuition would be that 8 threads is going to be maxing out your I/O throughput already, and you would see no performance improvement, but this would certainly be dependent on your specific setup.

Also, you can gain significant performance increases by taking advantage of the distribution of your data. For example, evenly distributed data can be sorted very quickly by a single bucket sort pass, and then using mergesort to sort the buckets. This also has the added benefit of decreasing the total memory overhead of mergesort. If the memory comlexity of mergesort is O(N), and you can separate your data into K buckets, your new memory overhead is O(N/K).

Upvotes: 5

Linux: sorting a 500GB text file with 10^10 records

Answers (3)

Related Questions