thoughtsnippet
thoughtsnippet

Reputation: 93

Speed up grep and awk with gnu-parallel

I am looking to speed up two lines of grep and awk code with the great gnu-parallel tool, but using the simple syntax it breaks down or loops to infinity. Help is greatly appreciated!

Normal code:

for FILENAME in `cat FileList.tmp`
do
  echo "Bearbeite $FILENAME ..."
  FILE_BASENAME=`echo ${FILENAME##*/}`
  grep -v "^t=[0-9]*.[0-9]*\&\-$" ${FILENAME} > ${INPUT}/cleaned/${FILE_BASENAME}.tmp
  awk '{ if (gsub("t=|r=|i=|d=|ip=|ua=|uc=|um=|ud=|pc=|la=|lo=|do=|dm=|c=","")) print; else print}' \
      ${INPUT}/cleaned/${FILE_BASENAME}.tmp > ${INPUT}/cleaned/${FILE_BASENAME}
  rm -f ${INPUT}/cleaned/${FILE_BASENAME}.tmp
done

Parallel try:

[...]  
parallel -j100 --pipe grep -v "^t=[0-9]*.[0-9]*\&\-$" | awk '{s = s + $1} END {print s, s/NR}' ${FILENAME} > ${INPUT}/cleaned/${FILE_BASENAME}.tmp  
awk '{ if (gsub("t=|r=|i=|d=|ip=|ua=|uc=|um=|ud=|pc=|la=|lo=|do=|dm=|c=","")) print; else print}' \
      ${INPUT}/cleaned/${FILE_BASENAME}.tmp > ${INPUT}/cleaned/${FILE_BASENAME}
 [...]

My thoughts are that I just piped the parallel commands the wrong way...

Upvotes: 2

Views: 1033

Answers (3)

Ole Tange
Ole Tange

Reputation: 33685

When you have a script that does the job for a single file, it is usually trivially simple to convert it to GNU Parallel:

bearbeite() {
  FILENAME=$1
  echo "Bearbeite $FILENAME ..."
  FILE_BASENAME=`echo ${FILENAME##*/}`
  grep -v "^t=[0-9]*.[0-9]*\&\-$" ${FILENAME} > ${INPUT}/cleaned/${FILE_BASENAME}.tmp
  awk '{ if (gsub("t=|r=|i=|d=|ip=|ua=|uc=|um=|ud=|pc=|la=|lo=|do=|dm=|c=","")) print; else print}' \
    ${INPUT}/cleaned/${FILE_BASENAME}.tmp > ${INPUT}/cleaned/${FILE_BASENAME}
  rm -f ${INPUT}/cleaned/${FILE_BASENAME}.tmp
}
export -f bearbeite
parallel bearbeite :::: FileList.tmp
# or:
cat FileList.tmp | parallel bearbeite

To avoid the temporary file this ought to work:

bearbeite() {
  FILENAME=$1
  echo "Bearbeite $FILENAME ..."
  FILE_BASENAME=`echo ${FILENAME##*/}`
  grep -v "^t=[0-9]*.[0-9]*\&\-$" ${FILENAME} |
  awk '{ if (gsub("t=|r=|i=|d=|ip=|ua=|uc=|um=|ud=|pc=|la=|lo=|do=|dm=|c=","")) print; else print}' > ${INPUT}/cleaned/${FILE_BASENAME}
}

Upvotes: 0

fedorqui
fedorqui

Reputation: 289665

Some thinkings:

while IFS= read -r FILENAME
do
   echo "Bearbeite $FILENAME ..."
   FILE_BASENAME=${FILENAME##*/} # no need to echo
   grep -v "^t=[0-9]*.[0-9]*\&\-$" ${FILENAME} > ${INPUT}/cleaned/${FILE_BASENAME}.tmp
   awk '{ if (gsub("t=|r=|i=|d=|ip=|ua=|uc=|um=|ud=|pc=|la=|lo=|do=|dm=|c=","")) print; else print}' \
    ${INPUT}/cleaned/${FILE_BASENAME}.tmp > ${INPUT}/cleaned/${FILE_BASENAME}
   rm -f ${INPUT}/cleaned/${FILE_BASENAME}.tmp
done < FileList.tmp
  • use while read ... done < file instead of cat blabla.
  • do not use echo ${FILENAME##*/} to assign the variable, just do FILE_BASENAME=${FILENAME##*/}.
  • explain what you want to accomplish with the grep/awk pair, because it can probably be improved. For example the following expression does not make much sense.

    awk '{if (gsub("t=|...|c=","")) print; else print}' ...
    

You want to perform either of these: replace and then print the line, or print the original line if no replacement was done. This you can do by directly saying gsub(); print, because gsub() updates the value of $0 (the line) in case it matches:

awk '{gsub("t=|...|c=",""); print}' ...

Upvotes: 2

Tom Fenech
Tom Fenech

Reputation: 74605

As fedorqui has already made some points on the structure of your loop, I will focus on combining the grep and awk parts:

awk '!(/^t=[0-9]*.[0-9]*\&\-$/) {
     gsub(/(t|r|i|d|ip|ua|uc|um|ud|pc|la|lo|do|dm|c)=/,""); print }' input > output

When the pattern doesn't match (same as grep -v), perform the substitution and print the result. Other lines will not be printed.

In awk, gsub modifies the target (the whole record, $0, by default) and returns the number of substitutions made. I have removed the conditional code as it seems that you want to print the record, whether any substitutions were made or not.

Upvotes: 1

Related Questions