dhenness
dhenness

Reputation: 23

What can I do to speed up this bash script?

The code I have goes through a file and multiplies all the numbers in the first column by a number. The code works, but I think its somewhat slow. It takes 26.676s (walltime) to go through a file with 2302 lines in it. I'm using a 2.7 GHz Intel Core i5 processor. Here is the code.

#!/bin/bash

i=2
sed -n 1p data.txt > data_diff.txt #outputs the header (x  y)
while [ $i -lt 2303 ]; do
    NUM=`sed -n "$i"p  data.txt | awk '{print $1}'`
    SEC=`sed -n "$i"p  data.txt | awk '{print $2}'`
    NNUM=$(bc <<< "$NUM*0.000123981")
    echo $NNUM $SEC >> data_diff.txt
    let i=$i+1
done

Upvotes: 0

Views: 564

Answers (3)

John Bollinger
John Bollinger

Reputation: 180191

Your script runs 4603 separate sed processes, 4602 separate awk processes, and 2301 separate bc processes. If echo were not a built-in then it would also run 2301 echo processes. Starting a process has relatively large overhead. Not so large that you would ordinarily notice it, but you are running over 11000 short processes. The wall time consumption doesn't seem unreasonable for that.

MOREOVER, each sed that you run processes the whole input file anew, selecting from it just one line. This is horribly inefficient.

The solution is to reduce the number of processes you are running, and especially to perform only a single run through the whole input file. A fairly easy way to do that would be to convert to an awk script, possibly with a bash wrapper. That might look something like this:

#!/bin/bash

awk '
NR==1    { print; next }
NR>=2303 { exit }
         { print $1 * 0.000123981, $2 }
' data.txt > data_diff.txt

Note that the line beginning with NR>=2303 artificially stops processing the input file when it reaches the 2303rd line, as your original script does; you could omit that line of the script altogether to let it simply process all the lines, however many there are.

Note, too, that that uses awk's built-in FP arithmetic instead of running bc. If you actually need the arbitrary-precision arithmetic of bc then I'm sure you can figure out how to modify the script to get that.

Upvotes: 4

chepner
chepner

Reputation: 531125

As an example of how to speed up the bash script (without implying that this is the right solution)

#!/bin/bash

{ IFS= read -r header 
  echo "$header"
  # You can drop the third name "rest" if your input file
  # only has two columns.
  while read -r num sec rest; do
      nnum=$( bc <<< "$num * 0.000123981" )
      echo "$nnum $sec"
  done
} < data.txt > data_diff.txt

Now you only have one extra call to bc per data line, necessitated by the fact that bash doesn't do floating-point arithmetic. The right answer is to use a single call to program that can do floating-point arithmetic, as pointed out by David Z.

Upvotes: 3

David Z
David Z

Reputation: 131570

Honestly, the biggest speedup you can get will come from using a single language that can do the whole task itself. This is mostly because your script invokes 5 extra processes for each line, and invoking extra processes is slow, but also text processing in bash is really not that well optimized.

I'd recommend awk, given that you have it available:

awk '{ print $1*0.000123981, $2 }'

I'm sure you can improve this to skip the header line and print it without modification.

You can also do this sort of thing with Perl, Python, C, Fortran, and many other languages, though it's unlikely to make much difference for such a simple calculation.

Upvotes: 5

Related Questions