bbvan
bbvan

Reputation: 658

Why using pipe for sort (linux command) is slow?

I have a large text file of ~8GB which I need to do some simple filtering and then sort all the rows. I am on a 28-core machine with SSD and 128GB RAM. I have tried

Method 1

awk '...' myBigFile | sort --parallel = 56 > myBigFile.sorted

Method 2

awk '...' myBigFile > myBigFile.tmp
sort --parallel 56 myBigFile.tmp > myBigFile.sorted

Surprisingly, method1 takes 11.5 min while method2 only takes (0.75 + 1 < 2) min. Why is sorting so slow when piped? Is it not paralleled?

EDIT

awk and myBigFile is not important, this experiment is repeatable by simply using seq 1 10000000 | sort --parallel 56 (thanks to @Sergei Kurenkov), and I also observed a six-fold speed improvement using un-piped version on my machine.

Upvotes: 8

Views: 12401

Answers (3)

JimD.
JimD.

Reputation: 2403

When reading from a pipe, sort assumes that the file is small, and for small files parallelism isn't helpful. To get sort to utilize parallelism you need to tell it to allocate a large main memory buffer using -S. In this case the data file is about 8GB, so you can use -S8G. However, at least on your system with 128GB of main memory, method 2 may still be faster.

This is because sort in method 2 can know from the size of the file that it is huge, and it can seek in the file (neither of which is possible for a pipe). Further, since you have so much memory compared to these file sizes, the data for myBigFile.tmp need not be written to disc before awk exits, and sort will be able to read the file from cache rather than disc. So the principle difference between method 1 and method 2 (on a machine like yours with lots of memory) is that sort in method 2 knows the file is huge and can easily divide up the work (possibly using seek, but I haven't looked at the implementation), whereas in method 1 sort has to discover the data is huge, and it can not use any parallelism in reading the input since it can't seek the pipe.

Upvotes: 7

ceving
ceving

Reputation: 23876

You have more system calls, if you use the pipe.

seq 1000000 | strace sort --parallel=56 2>&1 >/dev/null | grep read | wc -l
2059

Without the pipe the file is mapped into memory.

seq 1000000 > input
strace sort --parallel=56 input 2>&1 >/dev/null | grep read | wc -l
33

Kernel calls are in most cases the bottle neck. That is the reason why sendfile has been invented.

Upvotes: 2

user184968
user184968

Reputation:

I think sort does not use threads when read from pipe.

  1. I have used this command for your first case. And it shows that sort uses only 1 CPU even though it is told to use 4. atop actually also shows that there is only one thread in sort:

    /usr/bin/time -v bash -c "seq 1 1000000 | sort --parallel 4  > bf.txt"
    
  2. I have used this command for your second case. And it shows that sort uses 2 CPU. atop actually also shows that there are four thread in sort:

    /usr/bin/time -v bash -c "seq 1 1000000 > tmp.bf.txt && sort --parallel 4  tmp.bf.txt > bf.txt"
    

In you first scenario sort is an I/O bound task, it does lots of read syscalls from stdin. In your second scenario sort uses mmap syscalls to read file and it avoids being an I/O bound task.

Below are results for the first and second scenarios:

$ /usr/bin/time -v bash -c "seq 1 10000000 | sort --parallel 4  > bf.txt"
        Command being timed: "bash -c seq 1 10000000 | sort --parallel 4  > bf.txt"
        User time (seconds): 35.85
        System time (seconds): 0.84
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:37.43
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 9320
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 2899
        Voluntary context switches: 1920
        Involuntary context switches: 1323
        Swaps: 0
        File system inputs: 0
        File system outputs: 459136
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

$ /usr/bin/time -v bash -c "seq 1 10000000 > tmp.bf.txt && sort --parallel 4  tmp.bf.txt > bf.txt"
        Command being timed: "bash -c seq 1 10000000 > tmp.bf.txt && sort --parallel 4  tmp.bf.txt > bf.txt"
        User time (seconds): 43.03
        System time (seconds): 0.85
        Percent of CPU this job got: 175%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.97
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1018004
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 2445
        Voluntary context switches: 299
        Involuntary context switches: 4387
        Swaps: 0
        File system inputs: 0
        File system outputs: 308160
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Upvotes: 3

Related Questions