Reputation: 7595
The following bash code, which reads from a single input file line-by-line and writes to a large number (~100) of output files, exhibits unreasonable performance -- on the scale of 30 seconds for 10,000 lines, when I want it to be usable on a scale of millions or billions of lines of input.
In the following code, batches
is an already-defined associative array (in other languages, a map).
How can this be improved?
while IFS='' read -r line
do
x=`echo "$line" | cut -d" " -f1`;
y=`echo "$line" | cut -d" " -f2`;
# echo "find match between $x and $y";
a="${batches["$x"]}";
b="${batches["$y"]}";
if [ -z $a ] && [ -n $b ]
then echo "$line" >> Output/batch_$b.txt;
elif [ -n $a ] && [ -z $b ]
then echo "$line" >> Output/batch_$a.txt;
elif [ -z $a ] && [ -z $b ]
then echo "$line" >> Output/batch_0.txt;
elif [ $a -gt $b ]
then echo "$line" >> Output/batch_$a.txt;
elif [ $a -le $b ]
then echo "$line" >> Output/batch_$b.txt;
fi
done < input.txt
Upvotes: 1
Views: 343
Reputation: 296049
while IFS= read -r line; do
x=${line%%$'\t'*}; rest=${line#*$'\t'}
y=${rest%%$'\t'*}; rest=${rest#*$'\t'}
...
done <input.txt
That way you aren't starting two external programs every single time you want to split line
into x
and y
.
Under normal circumstances you could use read
to do string-splitting implicitly by reading the columns into different fields, but as read
trims leading whitespace, that doesn't work correctly if (as here) your columns are whitespace-separated and the first can be empty; consequently, it's necessary to use parameter expanison. See BashFAQ #73 for details on how parameter expansion works, or BashFAQ #100 for a general introduction to string manipulation with bash's native facilities.
Also, re-opening your output files every time you want to write a single line to them is silly at this kind of volume. Either use awk
, which will handle this for you automatically, or write a helper (note that the following requires a fairly new release of bash -- probably 4.2):
write_to_file() {
local filename content new_out_fd
filename=$1; shift
printf -v content '%s\t' "$@"
content=${content%$'\t'}
declare -g -A output_fds
if ! [[ ${output_fds[$filename]} ]]; then
exec {new_out_fd}>"$filename"
output_fds[$filename]=$new_out_fd
fi
printf '%s\n' "$content" >&"${output_fds[$filename]}"
}
...and then:
if [[ $a && ! $b ]]; then
write_to_file "Output/batch_$a.txt" "$line"
elif [[ ! $a ]] && [[ $b ]]; then
write_to_file "Output/batch_$b.txt" "$line"
elif [[ ! $a ]] && [[ ! $b ]]; then
write_to_file "Output/batch_0.txt" "$line"
elif (( a > b )); then
write_to_file "Output/batch_$a.txt" "$line"
else
write_to_file "Output/batch_$b.txt" "$line"
fi
Note that caching FDs only makes sense if you have few enough output files that you can maintain open file descriptors for each of them (and such that re-opening files receiving more than one write is a net benefit). Feel free to leave this out and only do faster string-splitting if it doesn't make sense for you.
Just for completion, here's another approach (also written using automatic fd management, thus requiring bash 4.2) -- running two cut invocations and letting them both run through the entirety of the input file.
exec {x_columns_fd}< <(cut -d" " -f1 <input.txt)
exec {y_columns_fd}< <(cut -d" " -f2 <input.txt)
while IFS='' read -r line && \
IFS='' read -r -u "$x_columns_fd" x && \
IFS='' read -r -u "$y_columns_fd" y; do
...
done <input.txt
This works because it's not cut
itself that's inefficient -- it's starting it up, running it, reading its output and shutting it down all the time that costs. If you just run two copies of cut, and let each of them process the whole file, performance is fine.
Upvotes: 3