Michael Heneghan
Michael Heneghan

Reputation: 307

Attempting to pass an escape char to awk as a variable

I am using this command;

awk -v regex1='new[[:blank:]]+File\(' 'BEGIN{print "Regex1 =", regex1}'

which warns me;

awk: warning: escape sequence `\(' treated as plain `(

which prints;

new[[:blank:]]+File( 

I would like the value to be;

new[[:blank:]]+File\(

I've tried amending the command to account for escape chars but always get the same warning

Upvotes: 2

Views: 1206

Answers (3)

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2855

i think gawk and mawk 1/2 are also okay with the hideous but fool-proof octal method like

  -v regex1="new[[:blank:]]+File[\\050]"  # note the double quotes

once the engine takes out the first \\ layer, the regex being tested against is equivalent to

/new[[:blank:]]+File[\050]/

which is as safe as it gets. Reason why it matters is that something like

/new[[:blank:]]+File[\(]/

is something mawk/mawk2 are totally cool with but gawk will give an annoying warning message. octals (or [\x28]) get rid of that cross-awk weirdness and allow the same custom string regex to be deployed across all 3

(haven't tested against less popular variants like BWK original or NAWK etc).

ps : since i'm on the subject of octal caveats, mawk/mawk2 and gawk in binary mode are cool with square bracket octals for all bytes, meaning

"[\\302-\\364][\\200-\\277]+"  # this happens to be a *very* rough proxy for UTF-8

is valid for all 3. if you really want to be the hex guy, that same regex becomes

"[\\xC2-\\xF4][\\x80-\\xBF]+" 

however, gawk in unicode mode will scream about locale whenever you attempt to put squares around any non-ASCII byte. To circumvent that, you'll have to just list them out with a bunch of or's like :

(\302|\303|\304.....|\364)(\200|\201......|\277)+

this way you can get gawk unicode mode to handle any arbitrary byte and also handle binary input data (whatever the circumstances caused that to happen), and perform full base64 or URI plus encoding/decoding from within (plus anything else you want, like SHA256 or LZMA etc).... So far I've even managed to get gawk in unicode mode to base64 encode an MP4 file input without gawk spitting out the "illegal multi byte" error message.

.....and also get gawk and mawk in binary modes to become mostly UTF-8 aware and safe.

The "mostly" caveat being I haven't implemented the minute details like directly doing normalization form conversions from within instead of dumping out to python3 and getting results back via getline, or keeping modifier linguistics marks with its intended character if i do a UC-safe-substring string-reversal.

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203985

When you run:

$ awk -v regex1='new[[:blank:]]+File\(' 'BEGIN{print "Regex1 =", regex1}'
awk: warning: escape sequence `\(' treated as plain `('
Regex1 = new[[:blank:]]+File(

you're in shell assigning a string to an awk variable. When you use -v in awk you're asking awk to interpret escape sequences in such an assignment so that \t can become a literal tab char, \n a newline, etc. but the ( in your string has no special meaning when escaped and so \( is exactly the same as (, hence the warning message.

If you want to get a literal \ character you'd need to escape it so that \\ gets interpreted as just \:

$ awk -v regex1='new[[:blank:]]+File\\(' 'BEGIN{print "Regex1 =", regex1}'
Regex1 = new[[:blank:]]+File\(

You seem to be trying to pass a regexp to awk and in my opinion once you get to needing 2 escapes your code is clearer and simpler if you put the target character into a bracket expression instead:

$ awk -v regex1='new[[:blank:]]+File[(]' 'BEGIN{print "Regex1 =", regex1}'
Regex1 = new[[:blank:]]+File[(]

If you want to assign an awk variable the value of a literal string with no interpretation of escape sequences then there are other ways of doing so without using -v, see How do I use shell variables in an awk script?.

Upvotes: 2

anubhava
anubhava

Reputation: 785481

If you use gnu awk then you can use a regexp literal with @/.../ format and avoid double escaping:

awk -v regex1='@/new[[:blank:]]+File\(/' 'BEGIN{print "Regex1 =", regex1}'

Regex1 = new[[:blank:]]+File\(

Upvotes: 2

Related Questions