Reputation: 93
I am looking to speed up two lines of grep and awk code with the great gnu-parallel tool, but using the simple syntax it breaks down or loops to infinity. Help is greatly appreciated!
Normal code:
for FILENAME in `cat FileList.tmp`
do
echo "Bearbeite $FILENAME ..."
FILE_BASENAME=`echo ${FILENAME##*/}`
grep -v "^t=[0-9]*.[0-9]*\&\-$" ${FILENAME} > ${INPUT}/cleaned/${FILE_BASENAME}.tmp
awk '{ if (gsub("t=|r=|i=|d=|ip=|ua=|uc=|um=|ud=|pc=|la=|lo=|do=|dm=|c=","")) print; else print}' \
${INPUT}/cleaned/${FILE_BASENAME}.tmp > ${INPUT}/cleaned/${FILE_BASENAME}
rm -f ${INPUT}/cleaned/${FILE_BASENAME}.tmp
done
Parallel try:
[...]
parallel -j100 --pipe grep -v "^t=[0-9]*.[0-9]*\&\-$" | awk '{s = s + $1} END {print s, s/NR}' ${FILENAME} > ${INPUT}/cleaned/${FILE_BASENAME}.tmp
awk '{ if (gsub("t=|r=|i=|d=|ip=|ua=|uc=|um=|ud=|pc=|la=|lo=|do=|dm=|c=","")) print; else print}' \
${INPUT}/cleaned/${FILE_BASENAME}.tmp > ${INPUT}/cleaned/${FILE_BASENAME}
[...]
My thoughts are that I just piped the parallel commands the wrong way...
Upvotes: 2
Views: 1033
Reputation: 33685
When you have a script that does the job for a single file, it is usually trivially simple to convert it to GNU Parallel:
bearbeite() {
FILENAME=$1
echo "Bearbeite $FILENAME ..."
FILE_BASENAME=`echo ${FILENAME##*/}`
grep -v "^t=[0-9]*.[0-9]*\&\-$" ${FILENAME} > ${INPUT}/cleaned/${FILE_BASENAME}.tmp
awk '{ if (gsub("t=|r=|i=|d=|ip=|ua=|uc=|um=|ud=|pc=|la=|lo=|do=|dm=|c=","")) print; else print}' \
${INPUT}/cleaned/${FILE_BASENAME}.tmp > ${INPUT}/cleaned/${FILE_BASENAME}
rm -f ${INPUT}/cleaned/${FILE_BASENAME}.tmp
}
export -f bearbeite
parallel bearbeite :::: FileList.tmp
# or:
cat FileList.tmp | parallel bearbeite
To avoid the temporary file this ought to work:
bearbeite() {
FILENAME=$1
echo "Bearbeite $FILENAME ..."
FILE_BASENAME=`echo ${FILENAME##*/}`
grep -v "^t=[0-9]*.[0-9]*\&\-$" ${FILENAME} |
awk '{ if (gsub("t=|r=|i=|d=|ip=|ua=|uc=|um=|ud=|pc=|la=|lo=|do=|dm=|c=","")) print; else print}' > ${INPUT}/cleaned/${FILE_BASENAME}
}
Upvotes: 0
Reputation: 289665
Some thinkings:
while IFS= read -r FILENAME
do
echo "Bearbeite $FILENAME ..."
FILE_BASENAME=${FILENAME##*/} # no need to echo
grep -v "^t=[0-9]*.[0-9]*\&\-$" ${FILENAME} > ${INPUT}/cleaned/${FILE_BASENAME}.tmp
awk '{ if (gsub("t=|r=|i=|d=|ip=|ua=|uc=|um=|ud=|pc=|la=|lo=|do=|dm=|c=","")) print; else print}' \
${INPUT}/cleaned/${FILE_BASENAME}.tmp > ${INPUT}/cleaned/${FILE_BASENAME}
rm -f ${INPUT}/cleaned/${FILE_BASENAME}.tmp
done < FileList.tmp
while read ... done < file
instead of cat
blabla.echo ${FILENAME##*/}
to assign the variable, just do FILE_BASENAME=${FILENAME##*/}
.explain what you want to accomplish with the grep/awk
pair, because it can probably be improved. For example the following expression does not make much sense.
awk '{if (gsub("t=|...|c=","")) print; else print}' ...
You want to perform either of these: replace and then print the line, or print the original line if no replacement was done. This you can do by directly saying gsub(); print
, because gsub()
updates the value of $0
(the line) in case it matches:
awk '{gsub("t=|...|c=",""); print}' ...
Upvotes: 2
Reputation: 74605
As fedorqui has already made some points on the structure of your loop, I will focus on combining the grep and awk parts:
awk '!(/^t=[0-9]*.[0-9]*\&\-$/) {
gsub(/(t|r|i|d|ip|ua|uc|um|ud|pc|la|lo|do|dm|c)=/,""); print }' input > output
When the pattern doesn't match (same as grep -v
), perform the substitution and print the result. Other lines will not be printed.
In awk, gsub
modifies the target (the whole record, $0
, by default) and returns the number of substitutions made. I have removed the conditional code as it seems that you want to print the record, whether any substitutions were made or not.
Upvotes: 1