Most efficient way to print text file to stdout to put through data processing pipeline using Shell

Question

I have a problem when processing text files in a data processing pipeline in Shell and Python.

What is a better solution to print text files to stdout to put through a data processing pipeline (using perl in the script tokenise.sh and python)?

My current script in Shell works fine except that it does not output the last line in a txt file. I'm not sure if I should use cat or echo or something else (instead of while IFS= read line ...) for better performance.

for f in path/to/dir/*.txt; do
  while IFS= read line
  do
    echo $line 
  done < "$f" \
  | tokenize.sh \
  | python clean.py \
  >> $f.clean.txt 
  rm $f 
  mv $f.clean.txt $f 
done

I tried using awk as below and it seems to work well.

for f in path/to/dir/*.txt; do
  awk '{ print }' $f \
  | tokenize.sh \
  | python clean.py \
  >> $f.clean.txt 
  rm $f 
  mv $f.clean.txt $f 
done

Wiimm · Accepted Answer

Try this:

for f in path/to/dir/*.txt; do

  # - while loop replaced by "<"
  # - $f quoted to handle special chars. <<< IMPORTANT!
  # - is ">>" really necessary?
  #   seems to have a side effect, if "$f.clean.txt" already exists

  tokenize.sh < "$f" | python clean.py > "$f.clean.txt"

  # "mv" includes "rm" and && file "$f" exists always
  # rm $f
  mv "$f.clean.txt" "$f"

done

Most efficient way to print text file to stdout to put through data processing pipeline using Shell

Answers (1)

Related Questions