Reputation: 223
I have a problem when processing text files in a data processing pipeline in Shell and Python.
What is a better solution to print text files to stdout
to put through a data processing pipeline (using perl
in the script tokenise.sh
and python
)?
My current script in Shell works fine except that it does not output the last line in a txt
file. I'm not sure if I should use cat
or echo
or something else (instead of while IFS= read line ...
) for better performance.
for f in path/to/dir/*.txt; do
while IFS= read line
do
echo $line
done < "$f" \
| tokenize.sh \
| python clean.py \
>> $f.clean.txt
rm $f
mv $f.clean.txt $f
done
I tried using awk
as below and it seems to work well.
for f in path/to/dir/*.txt; do
awk '{ print }' $f \
| tokenize.sh \
| python clean.py \
>> $f.clean.txt
rm $f
mv $f.clean.txt $f
done
Upvotes: 0
Views: 172
Reputation: 3562
Try this:
for f in path/to/dir/*.txt; do
# - while loop replaced by "<"
# - $f quoted to handle special chars. <<< IMPORTANT!
# - is ">>" really necessary?
# seems to have a side effect, if "$f.clean.txt" already exists
tokenize.sh < "$f" | python clean.py > "$f.clean.txt"
# "mv" includes "rm" and && file "$f" exists always
# rm $f
mv "$f.clean.txt" "$f"
done
Upvotes: 2