Calling shell from awk is incredibly slow

Question

I want to process a value in awk on the run. Value is processed via binary. I'm trying to do this following way but it is suuuuuper slow. Unusable slow. 5 million records without this processing finishes in 30 seconds. With it - I waited for several hours with no end result.

Am I doing something wrong? Is there a correct way to process a value in awk using external app?

bash call

#!/bin/bash
...    
cat ${INFILE} | awk -F"	" -v sh_dir="${DIRECTORY_PATH_SH}" outfile="${OUTFILE}" -f process.awk

process.awk

{   
    cmd=sh_dir"/sha_cipher"
    print $2 |& cmd
    close(cmd, "to")
    cmd |& getline encrypted_id
    close(cmd)

    printf "%s	%s	%s
", $1, encrypted_id, $19 >> outfile
}

INPUT:

2018-09-14 | AlexOrange | 15 | HTTP | 86914702 | 1 | 1 | NO | 79634 | 48249 | 127883 | LEFT | MODEL1 | SUBTYPE255 A536 | RS | SO | 94 | Elixir | RTT

OUTPUT:

2018-09-14 | 36c8387b7e334c38786d6d497b | RTT

Ed Morton · Accepted Answer

I don't have sha_cipher on my PC but let's imagine your shell command was tr 'a-z' 'A-Z' instead of sha_cipher. Look (tab-separated input):

$ cat file
2018-09-14      AlexOrange      15      HTTP    86914702        1       1       NO      79634   48249   127883  LEFT    MODEL1  SUBTYPE255 A536 RS      SO      94     Elixir   RTT
2018-09-14      Joe Bloggs      15      HTTP    86914702        1       1       NO      79634   48249   127883  LEFT    MODEL1  SUBTYPE255 A536 RS      SO      94     Elixir   RTT
2018-09-14      Sue Everyone    15      HTTP    86914702        1       1       NO      79634   48249   127883  LEFT    MODEL1  SUBTYPE255 A536 RS      SO      94     Elixir   RTT

$ cut -f2 file | tr 'a-z' 'A-Z'
ALEXORANGE
JOE BLOGGS
SUE EVERYONE

$ cut -f2 file | tr 'a-z' 'A-Z' |
awk 'BEGIN{FS=OFS="	"} NR==FNR{a[NR]=$0;next} {print $1, a[FNR], $19}' - file
2018-09-14      ALEXORANGE      RTT
2018-09-14      JOE BLOGGS      RTT
2018-09-14      SUE EVERYONE    RTT

That will be orders of magnitude more efficient than having awk start up a subshell to call your shell command once for every line of input, assuming sha_cipher can operate on multiple values in piped input like tr and most other text-processing shell commands can (cut, sed, grep, sort, uniq, etc...).

To test timing I created a file with 5 million lines in the same format as your provided sample input line and containing random strings in the 2nd field by using:

$ cat file
2018-09-14      AlexOrange      15      HTTP    86914702        1       1       NO      79634   48249   127883  LEFT    MODEL1  SUBTYPE255 A536 RS      SO      94     Elixir   RTT

$ tr -dc '[:alnum:]'  file5m

$ wc -l file5m
5000000 file5m

$ head -3 file5m
2018-09-14      fLSynM  15      HTTP    86914702        1       1       NO      79634   48249   127883  LEFT    MODEL1  SUBTYPE255 A536 RS      SO      94      Elixir RTT
2018-09-14      mxWzLF  15      HTTP    86914702        1       1       NO      79634   48249   127883  LEFT    MODEL1  SUBTYPE255 A536 RS      SO      94      Elixir RTT
2018-09-14      EKJYF8  15      HTTP    86914702        1       1       NO      79634   48249   127883  LEFT    MODEL1  SUBTYPE255 A536 RS      SO      94      Elixir RTT

and here's the result of running the proposed solution on it:

$ time cut -f2 file5m | tr 'a-z' 'A-Z' | awk 'BEGIN{FS=OFS="	"} NR==FNR{a[NR]=$0;next} {print $1, a[FNR], $19}' - file5m > outFile5m
real    0m40.892s
user    0m42.196s
sys     0m0.980s

$ wc -l outFile5m
5000000 outFile5m

$ head -3 outFile5m
2018-09-14      FLSYNM  RTT
2018-09-14      MXWZLF  RTT
2018-09-14      EKJYF8  RTT

So unless sha_cipher is far less efficient than tr 'a-z' 'A-Z' (if it is then you're just out of luck) then I expect the above should run fast enough for you (i.e. it should run in under a minute rather than taking several hours).

Calling shell from awk is incredibly slow

Answers (1)

Related Questions