Frank
Frank

Reputation: 7595

Optimizing performance of an IO-heavy shell script

The following bash code, which reads from a single input file line-by-line and writes to a large number (~100) of output files, exhibits unreasonable performance -- on the scale of 30 seconds for 10,000 lines, when I want it to be usable on a scale of millions or billions of lines of input.

In the following code, batches is an already-defined associative array (in other languages, a map).

How can this be improved?

while IFS='' read -r line
do
    x=`echo "$line" | cut -d"   " -f1`;
    y=`echo "$line" | cut -d"   " -f2`;
#   echo "find match between $x and $y";
    a="${batches["$x"]}";
    b="${batches["$y"]}";
    if [ -z $a ] && [ -n $b ]
        then echo "$line" >> Output/batch_$b.txt;
    elif [ -n $a ] && [ -z $b ]
        then echo "$line" >> Output/batch_$a.txt;
    elif [ -z $a ] && [ -z $b ]
            then echo "$line" >> Output/batch_0.txt;
    elif [ $a -gt $b ]
        then echo "$line" >> Output/batch_$a.txt;
    elif [ $a -le $b ]
            then echo "$line" >> Output/batch_$b.txt;
    fi

done < input.txt

Upvotes: 1

Views: 343

Answers (1)

Charles Duffy
Charles Duffy

Reputation: 296049

while IFS= read -r line; do
   x=${line%%$'\t'*}; rest=${line#*$'\t'}
   y=${rest%%$'\t'*}; rest=${rest#*$'\t'}
   ...
done <input.txt

That way you aren't starting two external programs every single time you want to split line into x and y.

Under normal circumstances you could use read to do string-splitting implicitly by reading the columns into different fields, but as read trims leading whitespace, that doesn't work correctly if (as here) your columns are whitespace-separated and the first can be empty; consequently, it's necessary to use parameter expanison. See BashFAQ #73 for details on how parameter expansion works, or BashFAQ #100 for a general introduction to string manipulation with bash's native facilities.


Also, re-opening your output files every time you want to write a single line to them is silly at this kind of volume. Either use awk, which will handle this for you automatically, or write a helper (note that the following requires a fairly new release of bash -- probably 4.2):

write_to_file() {
    local filename content new_out_fd
    filename=$1; shift
    printf -v content '%s\t' "$@"
    content=${content%$'\t'}

    declare -g -A output_fds
    if ! [[ ${output_fds[$filename]} ]]; then
      exec {new_out_fd}>"$filename"
      output_fds[$filename]=$new_out_fd
    fi
    printf '%s\n' "$content" >&"${output_fds[$filename]}"
}

...and then:

if [[ $a && ! $b ]]; then
    write_to_file "Output/batch_$a.txt" "$line"
elif [[ ! $a ]] && [[ $b ]]; then
    write_to_file "Output/batch_$b.txt" "$line"
elif [[ ! $a ]] && [[ ! $b ]]; then
    write_to_file "Output/batch_0.txt" "$line"
elif (( a > b )); then
    write_to_file "Output/batch_$a.txt" "$line"
else
    write_to_file "Output/batch_$b.txt" "$line"
fi

Note that caching FDs only makes sense if you have few enough output files that you can maintain open file descriptors for each of them (and such that re-opening files receiving more than one write is a net benefit). Feel free to leave this out and only do faster string-splitting if it doesn't make sense for you.


Just for completion, here's another approach (also written using automatic fd management, thus requiring bash 4.2) -- running two cut invocations and letting them both run through the entirety of the input file.

exec {x_columns_fd}< <(cut -d"   " -f1 <input.txt)
exec {y_columns_fd}< <(cut -d"   " -f2 <input.txt)
while IFS='' read -r line && \
      IFS='' read -r -u "$x_columns_fd" x && \
      IFS='' read -r -u "$y_columns_fd" y; do
  ...
done <input.txt

This works because it's not cut itself that's inefficient -- it's starting it up, running it, reading its output and shutting it down all the time that costs. If you just run two copies of cut, and let each of them process the whole file, performance is fine.

Upvotes: 3

Related Questions