Guido
Guido

Reputation: 53

Bash looping through file ends prematurely

I have troubles in Bash looping within a text file of ~20k lines.

Here is my (minimised) code:

LINE_NB=0
while IFS= read -r LINE; do
    LINE_NB=$((LINE_NB+1))
    CMD=$(sed "s/\([^ ]*\) .*/\1/" <<< ${LINE})
    echo "[${LINE_NB}] ${LINE}: CMD='${CMD}'"   
done <"${FILE}"

The while loop ends prematurely after a few hundreds iterations. However, the loop works correctly if I remove the CMD=$(sed...) part. So, evidently, there is some interference I cannot spot.

As I ready here, I also tried:

LINE_NB=0
while IFS= read -r -u4 LINE; do
    LINE_NB=$((LINE_NB+1))
    CMD=$(sed "s/\([^ ]*\) .*/\1/" <<< ${LINE})
    echo "[${LINE_NB}] ${LINE}: CMD='${CMD}'"
done 4<"${FILE}"

but nothing changes. Any explanation for this behaviour and help on how can I solve it?

Thanks!

To clarify the situation for user1934428 (thanks for your interest!), I now have created a minimal script and added "set -x". The full script is as follows:

#!/usr/bin/env bash
set -x
FILE="$1"
LINE_NB=0

while IFS= read -u "$file_fd" -r LINE; do
  LINE_NB=$((LINE_NB+1))
  CMD=$(sed "s/\([^ ]*\) .*/\1/" <<< "${LINE}")
  echo "[${LINE_NB}] ${LINE}: CMD='${CMD}'" #, TIME='${TIME}' "

done {file_fd}<"${FILE}"

echo "Done."

The input file is a list of ~20k lines of the form:

S1 0.018206
L1 0.018966
F1 0.006833
S2 0.004212
L2 0.008005
I8R190 18.3791
I4R349 18.5935
...

The while loops ends prematurely at (seemingly) random points. One possible output is:

+ FILE=20k/ir-collapsed.txt
+ LINE_NB=0
+ IFS=
+ read -u 10 -r LINE
+ LINE_NB=1
++ sed 's/\([^ ]*\) .*/\1/'
+ CMD=S1
+ echo '[1] S1 0.018206: CMD='\''S1'\'''
[1] S1 0.018206: CMD='S1'
+ echo '[6510] S1514 0.185504: CMD='\''S1514'\'''
...[snip]...
[6510] S1514 0.185504: CMD='S1514'
+ IFS=
+ read -u 10 -r LINE
+ echo Done.
Done.

As you can see, the loop ends prematurely after line 6510, while the input file is ~20k lines long.

Upvotes: 3

Views: 148

Answers (1)

Paul Hodges
Paul Hodges

Reputation: 15246

Yes, making a stable file copy is a best start.
Learning awk and/or perl is still well worth your time. It's not as hard as it looks. :)

Aside from that, a couple of optimizations - try to never run any program inside a loop when you can avoid it. For a 20k line file, that's 20k seds, which really adds up unnecessarily. Instead you could just use parameter parsing for this one.

# don't use all caps.
# cmd=$(sed "s/\([^ ]*\) .*/\1/" <<< "${line}") becomes
cmd="${cmd%% *}" # strip everything from the first space

Using the read to handle that is even better, since you were already using it anyway, but don't spawn another if you can avoid it. As much as I love it, read is pretty inefficient; it has to do a lot of fiddling to handle all its options.

while IFS= read -u "$file_fd" cmd timeval; do
  echo "[$((++line_nb))] CMD='${CMD}' TIME='${timeval}'"
done {file_fd}<"${file}"

or

while IFS= read -u "$file_fd" -r -a tok; do
  echo "[$((++line_nb))] LINE='${tok[@]}' CMD='${tok[0]}' TIME='${tok[1]}'"
done {file_fd}<"${file}"

(This will sort of rebuild the line, but if there were tabs or extra spaces, etc, it will only pad with the 1st char of $IFS, which is a space by default. Shouldn't matter here.)

awk would have made short work of this, though, and been a lot faster, with better tools already built in.

awk '{printf "NR=[%d] LINE=[%s] CMD=[%s] TIME=[%s]\n",NR,$0,$1,$2 }' 20k/ir-collapsed.txt

Run some time comparisons - with and without the sed, with one read vs two, and then compare each against the awk. :)

The more things you have to do with each line, and the more lines there are in the file, the more it will matter. Make it a habit to do even small things as neatly as you can - it will pay off well in the long run.

Upvotes: 2

Related Questions