Reputation: 157
I had meet a strange problem . I have a large file (maybe more than 1,000,000,000 lines) which contains only a single column which represent the size of file . It looks like
55568
9700
7243
9692
63
5508
1679
14072
.....
And I wants to count the occurences of each value. I use two different script
NOTE:: the file used below is cutted ,only contains 10,000 lines !!!
bob@bob-ruby:~$ cat 1.sh
#!/bin/bash
while read size ; do
set -- $size
((count[$1]++))
done < file-size.txt
bob@bob-ruby:~$
bob@bob-ruby:~$ cat 2.sh
#!/bin/bash
awk '{count[$1]++}' file-size.txt
bob@bob-ruby:~$
and I found that 1.sh (pure shell script) is much slower than 2.sh (awk-script)
bob@bob-ruby:~$ time bash 2.sh
real 0m0.045s
user 0m0.012s
sys 0m0.032s
bob@bob-ruby:~$ time bash 1.sh
real 0m0.618s
user 0m0.508s
sys 0m0.112s
bob@bob-ruby:~$
Through 'strace' command , I found 1.sh generated lots of syscall , while the '2.sh' is much less , why is that ?
Is that the 'awk' do some 'magic' work inside ?
bob@bob-ruby:~$ strace -c bash 1.sh
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
38.62 0.042011 1 30320 rt_sigprocmask
29.97 0.032597 2 20212 _llseek
15.33 0.016674 2 10115 read
12.57 0.013675 1 10106 10106 ioctl
(cut)
bob@bob-ruby:~$ strace -c bash 2.sh
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
95.52 0.008000 4000 2 1 waitpid
3.20 0.000268 21 13 5 access
1.28 0.000107 5 21 fstat64
0.00 0.000000 0 9 read
Upvotes: 4
Views: 3078
Reputation: 59563
The biggest difference is that the while
loop version is required to read the file one line at a time and awk
reads the input the entire file and parses it in memory. You are lucky that read
is a builtin or it would be considerably less efficient. The usual case for shell scripts is that each while
loop iteration spawns multiple child processes to process a line. They can be considerably slower - consider parsing a line into fields using the following:
while
read line
do
field1=`echo $line | cut -f 1 -d '|'`
field2=`echo $line | cut -f 2 -d '|'`
...
done
I inherited a shell script that processed database output this way. My manager was amazed when I turned a multi-hour batch process into about 20 minutes with a simple chunk of awk
.
Edit
I dug into the awk source code because I was curious about this one. It looks like this is a simple usage of the Standard IO buffering hidden behind a simple call to getc
. The C Standard Library implements efficient buffering on the input stream. I ran dtruss
using the following very simple shell script
#!/bin/zsh
while
read line
do
echo "$line"
done < blah.c
The input, blah.c, is a 191349 byte C file containing 7219 lines.
The dtruss
output contained 4266 calls to read
with a buffer size of 1 byte for the shell script. It looks like zsh
is not buffering its input at all. I did the same test using bash
and it contained precisely the same sequence of read
calls. Another important note is that zsh
generated 6074 system calls and bash
generated 6604 system calls.
The equivalent awk '{print}' blah.c
command showed 56 calls to read_nocancel
with a buffer size of 4096. It had a total of 160 system calls.
The easiest way to think about this is that awk
is a program that parses text for a living and shells are concerned with process management, plumbing together pipelines, and generally interactively running programs for a user. You should use the appropriate tool for the job at hand. If you are processing data from large files, steer clear of general purpose shell commands - that is not what the shell is meant to do and it will not do it very efficiently. If you are writing scripts that are executing shell utilities back to back, then you wouldn't want to write it in perl or python because it will be painful handling exit statuses of subprocesses and pipelining between them.
Upvotes: 3
Reputation: 157
The answer from Chet Ramey([email protected])
On 12/21/12 9:59 PM, boblin wrote:
hi, chet :
I had meet a strange problem . I have a large file (maybe more than
10,000 lines) which contains only a single column which represent the size of file . It looks like
55568 9700 7243 9692 63 5508 1679 14072 .....
And I wants to count the occurences of each value. I use two different method
bob@bob-ruby:~$ cat 1.sh #!/bin/bash while read size ; do set -- $size ((count[$1]++)) done < file-size.txt bob@bob-ruby:~$
This is really an inefficient way to do it, but not so much that it should make huge difference. There's no need to use `set' just for cosmetic reasons. You can do
while read size; do ((count[$size]++)) done < file-size.txt
bob@bob-ruby:~$ cat 2.sh #!/bin/bash awk '{count[$1]++}' file-size.txt bob@bob-ruby:~$
and I found that 1.sh (pure shell script) is much slower than 2.sh (awk-script)
bob@bob-ruby:~$ time bash 2.sh real 0m0.045s user 0m0.012s sys 0m0.032s bob@bob-ruby:~$ time bash 1.sh real 0m0.618s user 0m0.508s sys 0m0.112s bob@bob-ruby:~$
Through strace command , I found 1.sh generated lots of syscall , while the '2.sh' is much less , why is that ?
Because you didn't trace awk. You traced bash invoking and waiting for awk. That's why `waitpid' dominated the execution time.
Is that the awk do any 'magic' work inside ?
awk has many fewer restrictions on its operation, as explained below.
bob@bob-ruby:~$ strace -c bash 1.sh % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 38.62 0.042011 1 30320 rt_sigprocmask 29.97 0.032597 2 20212 _llseek 15.33 0.016674 2 10115 read 12.57 0.013675 1 10106 10106 ioctl
There is an issue with bash invoking sigprocmask a lot, because it invokes setjmp in such a way that setjmp saves and restores the signal mask. I did some work on signals and traps that will allow the next version to avoid restoring the signal mask.
The lseeks and reads have to stay. I imagine awk can read as much data as it wants into an internal buffer and process it from memory. The shell is required to reset the file offset back to what it has consumed after each read, so programs it invokes can get the intended standard input -- it is not allowed to read ahead between read builtin invocations. That means the shell has to test the file descriptor it's reading from for the ability to seek each time the read builtin is run -- terminals and pipes can't seek backward in the data stream, so the shell has to read one character at a time from those. The shell's read builtin does some minimal buffering, so even for regular files where the shell can seek backwards, it has to call lseek to adjust the file pointer before the read builtin can return a line. That also increases the number of read(2) calls required: in some cases, the shell reads the same data from the file more than once, and at least one read(2) call is required per invocation of the read builtin.
The ioctl is to tell whether or not the input fd is attached to a terminal; in addition to unbuffered reads, several options are only available when using a terminal. There is at least one lseek per call to the read builtin, to tell whether the input fd is a pipe.
That accounts for the system calls you listed in your strace output.
The lyf so short, the craft so long to lerne.'' - Chaucer
Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRU [email protected] http://cnswww.cns.cwru.edu/~chet/
Upvotes: 4