ailms
ailms

Reputation: 157

performance issue about 'while + read' .vs. AWK

I had meet a strange problem . I have a large file (maybe more than 1,000,000,000 lines) which contains only a single column which represent the size of file . It looks like

55568
9700
7243
9692
63
5508
1679
14072
.....

And I wants to count the occurences of each value. I use two different script

NOTE:: the file used below is cutted ,only contains 10,000 lines !!!

bob@bob-ruby:~$ cat 1.sh
#!/bin/bash

while read size ; do

      set -- $size

     ((count[$1]++))

done < file-size.txt
bob@bob-ruby:~$


bob@bob-ruby:~$ cat 2.sh
#!/bin/bash

awk '{count[$1]++}' file-size.txt
bob@bob-ruby:~$

and I found that 1.sh (pure shell script) is much slower than 2.sh (awk-script)

bob@bob-ruby:~$ time bash 2.sh

real    0m0.045s
user    0m0.012s
sys     0m0.032s
bob@bob-ruby:~$ time bash 1.sh

real    0m0.618s
user    0m0.508s
sys     0m0.112s
bob@bob-ruby:~$

Through 'strace' command , I found 1.sh generated lots of syscall , while the '2.sh' is much less , why is that ?

Is that the 'awk' do some 'magic' work inside ?

bob@bob-ruby:~$ strace -c bash 1.sh
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 38.62    0.042011           1     30320           rt_sigprocmask
 29.97    0.032597           2     20212           _llseek
 15.33    0.016674           2     10115           read
 12.57    0.013675           1     10106     10106 ioctl

 (cut)


 bob@bob-ruby:~$ strace -c bash 2.sh
 % time     seconds  usecs/call     calls    errors syscall
 ------ ----------- ----------- --------- --------- ----------------
  95.52    0.008000        4000         2         1 waitpid
   3.20    0.000268          21        13         5 access
   1.28    0.000107           5        21           fstat64
   0.00    0.000000           0         9           read

Upvotes: 4

Views: 3078

Answers (2)

D.Shawley
D.Shawley

Reputation: 59563

The biggest difference is that the while loop version is required to read the file one line at a time and awk reads the input the entire file and parses it in memory. You are lucky that read is a builtin or it would be considerably less efficient. The usual case for shell scripts is that each while loop iteration spawns multiple child processes to process a line. They can be considerably slower - consider parsing a line into fields using the following:

while
  read line
do
  field1=`echo $line | cut -f 1 -d '|'`
  field2=`echo $line | cut -f 2 -d '|'`
  ...
done

I inherited a shell script that processed database output this way. My manager was amazed when I turned a multi-hour batch process into about 20 minutes with a simple chunk of awk.

Edit
I dug into the awk source code because I was curious about this one. It looks like this is a simple usage of the Standard IO buffering hidden behind a simple call to getc. The C Standard Library implements efficient buffering on the input stream. I ran dtruss using the following very simple shell script

#!/bin/zsh
while
    read line
do
    echo "$line"
done < blah.c

The input, blah.c, is a 191349 byte C file containing 7219 lines.

The dtruss output contained 4266 calls to read with a buffer size of 1 byte for the shell script. It looks like zsh is not buffering its input at all. I did the same test using bash and it contained precisely the same sequence of read calls. Another important note is that zsh generated 6074 system calls and bash generated 6604 system calls.

The equivalent awk '{print}' blah.c command showed 56 calls to read_nocancel with a buffer size of 4096. It had a total of 160 system calls.


The easiest way to think about this is that awk is a program that parses text for a living and shells are concerned with process management, plumbing together pipelines, and generally interactively running programs for a user. You should use the appropriate tool for the job at hand. If you are processing data from large files, steer clear of general purpose shell commands - that is not what the shell is meant to do and it will not do it very efficiently. If you are writing scripts that are executing shell utilities back to back, then you wouldn't want to write it in perl or python because it will be painful handling exit statuses of subprocesses and pipelining between them.

Upvotes: 3

ailms
ailms

Reputation: 157

The answer from Chet Ramey([email protected])

On 12/21/12 9:59 PM, boblin wrote:

hi, chet :

I had meet a strange problem . I have a large file (maybe more than

10,000 lines) which contains only a single column which represent the size of file . It looks like

55568
9700
7243
9692
63
5508
1679
14072
.....

And I wants to count the occurences of each value. I use two different method

bob@bob-ruby:~$ cat 1.sh
#!/bin/bash

while read size ; do

      set -- $size

     ((count[$1]++))

done < file-size.txt
bob@bob-ruby:~$

This is really an inefficient way to do it, but not so much that it should make huge difference. There's no need to use `set' just for cosmetic reasons. You can do

while read size; do ((count[$size]++)) done < file-size.txt

bob@bob-ruby:~$ cat 2.sh
#!/bin/bash

awk '{count[$1]++}' file-size.txt
bob@bob-ruby:~$

and I found that 1.sh (pure shell script) is much slower than 2.sh (awk-script)

bob@bob-ruby:~$ time bash 2.sh

real    0m0.045s
user    0m0.012s
sys     0m0.032s
bob@bob-ruby:~$ time bash 1.sh

real    0m0.618s
user    0m0.508s
sys     0m0.112s
bob@bob-ruby:~$

Through strace command , I found 1.sh generated lots of syscall , while the '2.sh' is much less , why is that ?

Because you didn't trace awk. You traced bash invoking and waiting for awk. That's why `waitpid' dominated the execution time.

Is that the awk do any 'magic' work inside ?

awk has many fewer restrictions on its operation, as explained below.

bob@bob-ruby:~$ strace -c bash 1.sh
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 38.62    0.042011           1     30320           rt_sigprocmask
 29.97    0.032597           2     20212           _llseek
 15.33    0.016674           2     10115           read
 12.57    0.013675           1     10106     10106 ioctl

There is an issue with bash invoking sigprocmask a lot, because it invokes setjmp in such a way that setjmp saves and restores the signal mask. I did some work on signals and traps that will allow the next version to avoid restoring the signal mask.

The lseeks and reads have to stay. I imagine awk can read as much data as it wants into an internal buffer and process it from memory. The shell is required to reset the file offset back to what it has consumed after each read, so programs it invokes can get the intended standard input -- it is not allowed to read ahead between read builtin invocations. That means the shell has to test the file descriptor it's reading from for the ability to seek each time the read builtin is run -- terminals and pipes can't seek backward in the data stream, so the shell has to read one character at a time from those. The shell's read builtin does some minimal buffering, so even for regular files where the shell can seek backwards, it has to call lseek to adjust the file pointer before the read builtin can return a line. That also increases the number of read(2) calls required: in some cases, the shell reads the same data from the file more than once, and at least one read(2) call is required per invocation of the read builtin.

The ioctl is to tell whether or not the input fd is attached to a terminal; in addition to unbuffered reads, several options are only available when using a terminal. There is at least one lseek per call to the read builtin, to tell whether the input fd is a pipe.

That accounts for the system calls you listed in your strace output.

Chet

The lyf so short, the craft so long to lerne.'' - Chaucer Ars longa, vita brevis'' - Hippocrates Chet Ramey, ITS, CWRU [email protected] http://cnswww.cns.cwru.edu/~chet/

Upvotes: 4

Related Questions