Reputation: 13
I want to process a value in awk on the run. Value is processed via binary. I'm trying to do this following way but it is suuuuuper slow. Unusable slow. 5 million records without this processing finishes in 30 seconds. With it - I waited for several hours with no end result.
Am I doing something wrong? Is there a correct way to process a value in awk using external app?
bash call
#!/bin/bash
...
cat ${INFILE} | awk -F"\t" -v sh_dir="${DIRECTORY_PATH_SH}" outfile="${OUTFILE}" -f process.awk
process.awk
{
cmd=sh_dir"/sha_cipher"
print $2 |& cmd
close(cmd, "to")
cmd |& getline encrypted_id
close(cmd)
printf "%s\t%s\t%s\n", $1, encrypted_id, $19 >> outfile
}
INPUT:
2018-09-14 | AlexOrange | 15 | HTTP | 86914702 | 1 | 1 | NO | 79634 | 48249 | 127883 | LEFT | MODEL1 | SUBTYPE255 A536 | RS | SO | 94 | Elixir | RTT
OUTPUT:
2018-09-14 | 36c8387b7e334c38786d6d497b | RTT
Upvotes: 1
Views: 302
Reputation: 203557
I don't have sha_cipher
on my PC but let's imagine your shell command was tr 'a-z' 'A-Z'
instead of sha_cipher
. Look (tab-separated input):
$ cat file
2018-09-14 AlexOrange 15 HTTP 86914702 1 1 NO 79634 48249 127883 LEFT MODEL1 SUBTYPE255 A536 RS SO 94 Elixir RTT
2018-09-14 Joe Bloggs 15 HTTP 86914702 1 1 NO 79634 48249 127883 LEFT MODEL1 SUBTYPE255 A536 RS SO 94 Elixir RTT
2018-09-14 Sue Everyone 15 HTTP 86914702 1 1 NO 79634 48249 127883 LEFT MODEL1 SUBTYPE255 A536 RS SO 94 Elixir RTT
$ cut -f2 file | tr 'a-z' 'A-Z'
ALEXORANGE
JOE BLOGGS
SUE EVERYONE
$ cut -f2 file | tr 'a-z' 'A-Z' |
awk 'BEGIN{FS=OFS="\t"} NR==FNR{a[NR]=$0;next} {print $1, a[FNR], $19}' - file
2018-09-14 ALEXORANGE RTT
2018-09-14 JOE BLOGGS RTT
2018-09-14 SUE EVERYONE RTT
That will be orders of magnitude more efficient than having awk start up a subshell to call your shell command once for every line of input, assuming sha_cipher
can operate on multiple values in piped input like tr
and most other text-processing shell commands can (cut, sed, grep, sort, uniq, etc...).
To test timing I created a file with 5 million lines in the same format as your provided sample input line and containing random strings in the 2nd field by using:
$ cat file
2018-09-14 AlexOrange 15 HTTP 86914702 1 1 NO 79634 48249 127883 LEFT MODEL1 SUBTYPE255 A536 RS SO 94 Elixir RTT
$ tr -dc '[:alnum:]' </dev/urandom | fold -w 6 | head -5000000 |
awk 'BEGIN{FS=OFS="\t"} NR==FNR{orig=$0;next} {x=$0; $0=orig; $2=x}1' file - > file5m
$ wc -l file5m
5000000 file5m
$ head -3 file5m
2018-09-14 fLSynM 15 HTTP 86914702 1 1 NO 79634 48249 127883 LEFT MODEL1 SUBTYPE255 A536 RS SO 94 Elixir RTT
2018-09-14 mxWzLF 15 HTTP 86914702 1 1 NO 79634 48249 127883 LEFT MODEL1 SUBTYPE255 A536 RS SO 94 Elixir RTT
2018-09-14 EKJYF8 15 HTTP 86914702 1 1 NO 79634 48249 127883 LEFT MODEL1 SUBTYPE255 A536 RS SO 94 Elixir RTT
and here's the result of running the proposed solution on it:
$ time cut -f2 file5m | tr 'a-z' 'A-Z' | awk 'BEGIN{FS=OFS="\t"} NR==FNR{a[NR]=$0;next} {print $1, a[FNR], $19}' - file5m > outFile5m
real 0m40.892s
user 0m42.196s
sys 0m0.980s
$ wc -l outFile5m
5000000 outFile5m
$ head -3 outFile5m
2018-09-14 FLSYNM RTT
2018-09-14 MXWZLF RTT
2018-09-14 EKJYF8 RTT
So unless sha_cipher
is far less efficient than tr 'a-z' 'A-Z'
(if it is then you're just out of luck) then I expect the above should run fast enough for you (i.e. it should run in under a minute rather than taking several hours).
Upvotes: 2