Reputation: 169
I'm updating an old script to parse ARP data and get useful information out of it. We added a new router and while I can pull the ARP data out of the router it's in a new format. I've got a file "zTempMonth" which is a all the arp data from both sets of routers that I need to compile down into a new data format that's normalized. The below lines of code do what I need them to logically - but it's extremely slow - as in it will take days to run these loops where previously the script took 20-30 minutes. Is there a way to speed this up, or identify what's slowing it down?
Thank you in advance,
echo "Parsing zTempMonth"
while read LINE
do
wc=`echo $LINE | wc -w`
if [[ $wc -eq "6" ]]; then
true
out=$(echo $LINE | awk '{ print $2 " " $4 " " $6}')
echo $out >> zTempMonth.tmp
else
false
fi
if [[ $wc -eq "4" ]]; then
true
out=$(echo $LINE | awk '{ print $1 " " $3 " " $4}')
echo $out >> zTempMonth.tmp
else
false
fi
done < zTempMonth
Upvotes: 4
Views: 11479
Reputation: 19016
When writing shell scripts, it’s almost always better to call a function directly rather than using a subshell to call the function. The usual convention that I’ve seen is to echo the return value of the function and capture that output using a subshell.
For example:
#!/bin/bash
function get_path() {
echo "/path/to/something"
}
mypath="$(get_path)"
This works fine, but there is a significant speed overhead to using a subshell and there is a much faster alternative. Instead, you can just have a convention wherein a particular variable is always the return value of the function (I use retval
). This has the added benefit of also allowing you to return arrays from your functions.
If you don’t know what a subshell is, for the purposes of this blog post, a subshell is another bash shell that is spawned whenever you use $()
or ``
and is used to execute the code you put inside.
I did some simple testing to allow you to observe the overhead. For two functionally equivalent scripts:
This one uses a subshell
:
#!/bin/bash
function foo() {
# Return value
echo hello
}
for (( i = 0; i < 10000; i++ )); do
result="$(foo)"
echo $result
done
This one uses a variable
:
#!/bin/bash
# Initialize
retval=""
function foo() {
# Return value
retval="hello"
}
for (( i = 0; i < 10000; i++ )); do
foo
echo $retval
done
The speed difference between these two is noticeable and significant.
$ for i in variable subshell; do
> echo -e "\n$i"
> time ./$i > /dev/null
> done
variable
real 0m0.367s
user 0m0.346s
sys 0m0.015s
subshell
real 0m11.937s
user 0m3.121s
sys 0m0.359s
(
variable
andsubshell
are executable scripts)
As you can see, when using variable
, execution takes 0.367 seconds. subshell
however takes a full 11.937 seconds!
Source: http://rus.har.mn/blog/2010-07-05/subshells/
Finally, you can rewrite your script like following:
echo "Parsing zTempMonth"
while read LINE ; do
# Save the output at a temporal file
echo $LINE | wc -w > zTempMonth-x.tmp
# Read file line by line
wc=''
while read line; do
wc="$wc
$line"
done < zTempMonth-x.tmp
if [[ $wc -eq "6" ]]; then
true
echo $LINE | awk '{ print $2 " " $4 " " $6}' >> zTempMonth.tmp
else
false
fi
if [[ $wc -eq "4" ]]; then
true
echo $LINE | awk '{ print $1 " " $3 " " $4}' >> zTempMonth.tmp
else
false
fi
done < zTempMonth
Upvotes: 8
Reputation: 77197
>>
(open(f, 'a')
) calls in a loop are slow.You could speed this up and remain in pure bash, just by losing #2 and #3:
#!/usr/bin/env bash
while read -a line; do
case "${#line[@]}" in
6) printf '%s %s %s\n' "${line[1]}" "${line[3]}" "${line[5]}";;
4) printf '%s %s %s\n' "${line[0]}" "${line[2]}" "${line[3]}";;
esac
done < zTempMonth >> zTempMonth.tmp
But if there are more than a few lines, this will still be slower than pure awk. Consider an awk script as simple as this:
BEGIN {
print "Parsing zTempMonth"
}
NF == 6 {
print $2 " " $4 " " $6
}
NF == 4 {
print $1 " " $3 " " $4
}
You could execute it like this:
awk -f thatAwkScript zTempMonth >> zTempMonth.tmp
to get the same append approach as your current script.
Upvotes: 11