Reputation: 3742
We are sorting a 5GB file with 37 fields and sort it with 5 keys. The big file is composed of 1000 files of 5MB each.
After 190 minutes it still hasn't finished.
I am wondering if there are other methods to speed up the sorting. We choose unix sort because we don't want it to use up all the memory, so any memory based approach is not okay.
What is the advantage of sorting each files independently, and then use -m option to merge sort it?
Upvotes: 36
Views: 21605
Reputation: 33725
parsort
is a part of GNU Parallel. It chops files into smaller parts, sort these in parallel and does a merge.
It uses GNU Parallel and GNU sort, but outperforms sort --parallel
by 3x on a 64 core machine.
Upvotes: 2
Reputation: 129
Another optimization:
LC_ALL=C sort ...
And if, for example, you have csv-file with quoted numbers and need to sort numerically on col2, col1, then use:
-t',' -k2.2,2n -k1.2,1n
And to pre-sort files before merging (sort -m) them into a single output-file, use
LC_ALL=C ls -1 *.csv |
xargs -d'\n' -n1 -P4 -I'{}' sort ... -o '{}' '{}'
LC_ALL sort -m --parallel=20 --buffer-size=30% ... -o sort.out *.csv
Upvotes: 4
Reputation: 2488
Buffer it in memory using -S
. For example, to use (up to) 50% of your memory as a sorting buffer do:
sort -S 50% file
Note that modern Unix sort
can sort in parallel. My experience is that it automatically uses as many cores as possible. You can set it directly using --parallel
. To sort using 4 threads:
sort --parallel=4 file
So all in all, you should put everything into one file and execute something like:
sort -S 50% --parallel=4 file
Upvotes: 64
Reputation: 3388
Split the file into smaller files, sort the smaller files using many cpus, merge back together.
I've done this in the past:
split -l5000000 data.tsv '_tmp';
ls -1 _tmp* | while read FILE; do sort $FILE -o $FILE & done;
sort -m _tmp* -o data.tsv.sorted
it worked well for me.
Example performance on 20M line file :
joshua10.13> wc randn20M.csv
20000000 20000000 163197726 randn20M.csv
joshua10.14> cat par_srt.sh
#!/bin/bash
split -l5000000 randn20M.csv '_tmp';
ls -1 _tmp* | while read FILE; do sort $FILE -o $FILE & done;
sort -m _tmp* -o data.tsv.sorted
joshua10.15> time ./par_srt.sh
1.516u 0.344s 0:05.85 31.6% 0+0k 0+522584io 0pf+0w
joshua10.16> time sort randn20M.csv -o dataone.sorted
21.461u 0.596s 0:24.08 91.5% 0+0k 0+318752io 0pf+0w
Remark: if you are I/O bound (e.g. a 20g file with 20 lines) then this won't help at all.
Upvotes: 3
Reputation: 310980
The Unix sort isn't the fastest sort out there by any manner of means. It uses a strange implementation that can easily be outrun over data sets that are large enough to require multiple merge passes, as yours clearly does. I would have a look around for a replacement. You could even consider loading the file into a database: you might get better performance that way, and you would certainly have the data in a more convenient form afterwards.
For completeness, the main issue is the bucket sort itself. It's quick for small data sets, although not as quick as Quicksort, but it produces twice as many runs as replacement selection would. Once you get into multilevel merging, the number of runs and hence the number of merge passes totally dominates the CPU-bound distribution phase.
I implemented a sort-merge package for COBOL many years ago, straight out of Knuth vol. III, with distribution via replacement selection, and balanced merging with dummy runs. On large enough data sets it easily outperformed the Unix sort, with an increasing gradient as N increased, and 'large enough' wasn't all that large given disk sizes of those days.
Upvotes: 5
Reputation: 754550
One of the main time consumers with Unix sort
is finding the keys; that is anything but the simple comparison operation that you tend to see in simple sorting exercises. Even finding one of the keys is quite a slow process.
So, one way to speed things up is to make it easier for the sort
to find the keys, by preprocessing the file so that the 5 keys you mention are at the front of each line, then sorting the data (maybe using the splitting and merging operations suggested by others) and then removing the keys.
For example, if you have colon-separated fields, and the sort keys are 1, 3, 7, 10, 12, and they're all regular alphabetic sorts, then you might use:
awk -F: '{print "%s:%s:%s:%s:%s:%s\n", $1, $3, $7, $10, $12, $0; }' monster-file |
sort -t: -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 |
sed 's/^[^:]*:[^:]*:[^:]*:[^:]*:[^:]*://'
You might even be able to do without the five -k
options and simply run sort -t:
. In fact, you can probably arrange to use a different delimiter altogether (maybe a control character such as ^A) to simplify the code. You separate the key fields from the main record with this alternative character:
awk -F: '{print "%s:%s:%s:%s:%s^A%s\n", $1, $3, $7, $10, $12, $0; }' monster-file |
sort -t$'\001' |
sed 's/^[^^A]*^A//'
This uses a bash
-ism (ANSI-C Quoting) in the $'\001'
argument to sort
; the ^A
items in the awk
and sed
scripts are what you get from typing Control-A, though you could also arrange for the bash
notation to provide the character too:
awk -F: '{print "%s:%s:%s:%s:%s'$'\001''%s\n", $1, $3, $7, $10, $12, $0; }' monster-file |
sort -t$'\001' |
sed "s/^[^$'\001']*$'\001'//"
(Warning: untested scripting.)
There's a fascinating article on re-engineering the Unix sort ('Theory and Practice in the Construction of a Working Sort Routine', J P Linderman, AT&T Bell Labs Tech Journal, Oct 1984) that's not readily available (I've not found it on the Internet despite several attempts to search for it), that describes how /bin/sort
was improved. Even after all the improvements, one of its recommendations for complex sorts was exactly along these lines.
Upvotes: 5
Reputation: 72707
sort -m files
...
; -m is POSIX and should be supported by all sorts of sorts; pun intended). Sorting each file consumes much less resources.Upvotes: 6