Reputation: 658
I have a large text file of ~8GB which I need to do some simple filtering and then sort all the rows. I am on a 28-core machine with SSD and 128GB RAM. I have tried
Method 1
awk '...' myBigFile | sort --parallel = 56 > myBigFile.sorted
Method 2
awk '...' myBigFile > myBigFile.tmp
sort --parallel 56 myBigFile.tmp > myBigFile.sorted
Surprisingly, method1 takes 11.5 min while method2 only takes (0.75 + 1 < 2) min. Why is sorting so slow when piped? Is it not paralleled?
EDIT
awk
and myBigFile
is not important, this experiment is repeatable by simply using seq 1 10000000 | sort --parallel 56
(thanks to @Sergei Kurenkov), and I also observed a six-fold speed improvement using un-piped version on my machine.
Upvotes: 8
Views: 12401
Reputation: 2403
When reading from a pipe, sort
assumes that the file is small, and for small files parallelism isn't helpful. To get sort
to utilize parallelism you need to tell it to allocate a large main memory buffer using -S
. In this case the data file is about 8GB, so you can use -S8G
. However, at least on your system with 128GB of main memory, method 2 may still be faster.
This is because sort
in method 2 can know from the size of the file that it is huge, and it can seek in the file (neither of which is possible for a pipe). Further, since you have so much memory compared to these file sizes, the data for myBigFile.tmp
need not be written to disc before awk
exits, and sort
will be able to read the file from cache rather than disc. So the principle difference between method 1 and method 2 (on a machine like yours with lots of memory) is that sort
in method 2 knows the file is huge and can easily divide up the work (possibly using seek, but I haven't looked at the implementation), whereas in method 1 sort
has to discover the data is huge, and it can not use any parallelism in reading the input since it can't seek the pipe.
Upvotes: 7
Reputation: 23876
You have more system calls, if you use the pipe.
seq 1000000 | strace sort --parallel=56 2>&1 >/dev/null | grep read | wc -l
2059
Without the pipe the file is mapped into memory.
seq 1000000 > input
strace sort --parallel=56 input 2>&1 >/dev/null | grep read | wc -l
33
Kernel calls are in most cases the bottle neck. That is the reason why sendfile has been invented.
Upvotes: 2
Reputation:
I think sort does not use threads when read from pipe.
I have used this command for your first case. And it shows that sort
uses only 1 CPU even though it is told to use 4. atop
actually also shows that there is only one thread in sort
:
/usr/bin/time -v bash -c "seq 1 1000000 | sort --parallel 4 > bf.txt"
I have used this command for your second case. And it shows that sort uses 2 CPU. atop
actually also shows that there are four thread in sort
:
/usr/bin/time -v bash -c "seq 1 1000000 > tmp.bf.txt && sort --parallel 4 tmp.bf.txt > bf.txt"
In you first scenario sort is an I/O bound task, it does lots of read
syscalls from stdin. In your second scenario sort uses mmap
syscalls to read file and it avoids being an I/O bound task.
Below are results for the first and second scenarios:
$ /usr/bin/time -v bash -c "seq 1 10000000 | sort --parallel 4 > bf.txt"
Command being timed: "bash -c seq 1 10000000 | sort --parallel 4 > bf.txt"
User time (seconds): 35.85
System time (seconds): 0.84
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:37.43
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 9320
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 2899
Voluntary context switches: 1920
Involuntary context switches: 1323
Swaps: 0
File system inputs: 0
File system outputs: 459136
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
$ /usr/bin/time -v bash -c "seq 1 10000000 > tmp.bf.txt && sort --parallel 4 tmp.bf.txt > bf.txt"
Command being timed: "bash -c seq 1 10000000 > tmp.bf.txt && sort --parallel 4 tmp.bf.txt > bf.txt"
User time (seconds): 43.03
System time (seconds): 0.85
Percent of CPU this job got: 175%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.97
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1018004
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 2445
Voluntary context switches: 299
Involuntary context switches: 4387
Swaps: 0
File system inputs: 0
File system outputs: 308160
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Upvotes: 3