lamwaiman1988
lamwaiman1988

Reputation: 3742

How do we sort faster using unix sort?

We are sorting a 5GB file with 37 fields and sort it with 5 keys. The big file is composed of 1000 files of 5MB each.

After 190 minutes it still hasn't finished.

I am wondering if there are other methods to speed up the sorting. We choose unix sort because we don't want it to use up all the memory, so any memory based approach is not okay.

What is the advantage of sorting each files independently, and then use -m option to merge sort it?

Upvotes: 36

Views: 21605

Answers (7)

Ole Tange
Ole Tange

Reputation: 33725

parsort is a part of GNU Parallel. It chops files into smaller parts, sort these in parallel and does a merge.

It uses GNU Parallel and GNU sort, but outperforms sort --parallel by 3x on a 64 core machine.

Upvotes: 2

druid62
druid62

Reputation: 129

Another optimization:

LC_ALL=C sort ...

And if, for example, you have csv-file with quoted numbers and need to sort numerically on col2, col1, then use:

-t',' -k2.2,2n -k1.2,1n

And to pre-sort files before merging (sort -m) them into a single output-file, use

LC_ALL=C ls -1 *.csv |
  xargs -d'\n' -n1 -P4 -I'{}' sort ... -o '{}' '{}'

LC_ALL sort -m --parallel=20 --buffer-size=30% ... -o sort.out *.csv

Upvotes: 4

Malcolm
Malcolm

Reputation: 2488

Buffer it in memory using -S. For example, to use (up to) 50% of your memory as a sorting buffer do:

sort -S 50% file

Note that modern Unix sort can sort in parallel. My experience is that it automatically uses as many cores as possible. You can set it directly using --parallel. To sort using 4 threads:

sort --parallel=4 file

So all in all, you should put everything into one file and execute something like:

sort -S 50% --parallel=4 file

Upvotes: 64

dranxo
dranxo

Reputation: 3388

Split the file into smaller files, sort the smaller files using many cpus, merge back together.

I've done this in the past:

split -l5000000 data.tsv '_tmp';
ls -1 _tmp* | while read FILE; do sort $FILE -o $FILE & done;
sort -m _tmp* -o data.tsv.sorted

it worked well for me.

Example performance on 20M line file :

joshua10.13> wc randn20M.csv 
 20000000  20000000 163197726 randn20M.csv
joshua10.14> cat par_srt.sh 
#!/bin/bash

split -l5000000 randn20M.csv '_tmp';
ls -1 _tmp* | while read FILE; do sort $FILE -o $FILE & done;
sort -m _tmp* -o data.tsv.sorted
joshua10.15> time ./par_srt.sh 
1.516u 0.344s 0:05.85 31.6%     0+0k 0+522584io 0pf+0w

joshua10.16> time sort randn20M.csv -o dataone.sorted
21.461u 0.596s 0:24.08 91.5%    0+0k 0+318752io 0pf+0w

Remark: if you are I/O bound (e.g. a 20g file with 20 lines) then this won't help at all.

Upvotes: 3

user207421
user207421

Reputation: 310980

The Unix sort isn't the fastest sort out there by any manner of means. It uses a strange implementation that can easily be outrun over data sets that are large enough to require multiple merge passes, as yours clearly does. I would have a look around for a replacement. You could even consider loading the file into a database: you might get better performance that way, and you would certainly have the data in a more convenient form afterwards.

For completeness, the main issue is the bucket sort itself. It's quick for small data sets, although not as quick as Quicksort, but it produces twice as many runs as replacement selection would. Once you get into multilevel merging, the number of runs and hence the number of merge passes totally dominates the CPU-bound distribution phase.

I implemented a sort-merge package for COBOL many years ago, straight out of Knuth vol. III, with distribution via replacement selection, and balanced merging with dummy runs. On large enough data sets it easily outperformed the Unix sort, with an increasing gradient as N increased, and 'large enough' wasn't all that large given disk sizes of those days.

Upvotes: 5

Jonathan Leffler
Jonathan Leffler

Reputation: 754550

One of the main time consumers with Unix sort is finding the keys; that is anything but the simple comparison operation that you tend to see in simple sorting exercises. Even finding one of the keys is quite a slow process.

So, one way to speed things up is to make it easier for the sort to find the keys, by preprocessing the file so that the 5 keys you mention are at the front of each line, then sorting the data (maybe using the splitting and merging operations suggested by others) and then removing the keys.

For example, if you have colon-separated fields, and the sort keys are 1, 3, 7, 10, 12, and they're all regular alphabetic sorts, then you might use:

awk  -F: '{print "%s:%s:%s:%s:%s:%s\n", $1, $3, $7, $10, $12, $0; }' monster-file |
sort -t: -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 |
sed 's/^[^:]*:[^:]*:[^:]*:[^:]*:[^:]*://'

You might even be able to do without the five -k options and simply run sort -t:. In fact, you can probably arrange to use a different delimiter altogether (maybe a control character such as ^A) to simplify the code. You separate the key fields from the main record with this alternative character:

awk  -F: '{print "%s:%s:%s:%s:%s^A%s\n", $1, $3, $7, $10, $12, $0; }' monster-file |
sort -t$'\001' |
sed 's/^[^^A]*^A//'

This uses a bash-ism (ANSI-C Quoting) in the $'\001' argument to sort; the ^A items in the awk and sed scripts are what you get from typing Control-A, though you could also arrange for the bash notation to provide the character too:

awk  -F: '{print "%s:%s:%s:%s:%s'$'\001''%s\n", $1, $3, $7, $10, $12, $0; }' monster-file |
sort -t$'\001' |
sed "s/^[^$'\001']*$'\001'//"

(Warning: untested scripting.)

There's a fascinating article on re-engineering the Unix sort ('Theory and Practice in the Construction of a Working Sort Routine', J P Linderman, AT&T Bell Labs Tech Journal, Oct 1984) that's not readily available (I've not found it on the Internet despite several attempts to search for it), that describes how /bin/sort was improved. Even after all the improvements, one of its recommendations for complex sorts was exactly along these lines.

Upvotes: 5

Jens
Jens

Reputation: 72707

  1. Divide and conquer. A sort of N files can be faster if you first sort each of the N files (and use different CPUs on multiprocessors). Then the files need only be merged (e.g. sort -m files ...; -m is POSIX and should be supported by all sorts of sorts; pun intended). Sorting each file consumes much less resources.
  2. Give sort a fast /tmp directory
  3. Thinking outside the box: make the process creating the files sort the data right away
  4. Brute force: Throw more hardware (memory, CPU cycles) at the problem :-)
  5. Get informed about the concept of external sorting

Upvotes: 6

Related Questions