Reputation: 17208
Let's suppose that you got octal escape sequences in a stream:
backslash \134 is escaped as \134134
single quote ' and double quote \042
linefeed `\012` and carriage return `\015`
%s &
etc...
note: In my input the escaped characters are limited to 0x01-0x1F 0x22 0x5C 0x7F
How can you revert those escape sequences back to their corresponding character with awk
?
While awk
is able to understand them out-of-box when used in a literal string or as parameter argument, I can't find the way to leverage this capability when the escape sequence is part of the data. For now I'm using one gsub
per escape sequence but it doesn't feel efficient.
Here's the expected output for the given sample:
backslash \ is escaped as \134
single quote ' and double quote "
linefeed `
` and carriage return `
%s &
etc...
Upvotes: 1
Views: 1004
Reputation: 2865
you can decode any 3-digit octal value by just manually interpreting the first 2 digits :
jot 512 0 | mawk '$++NF = sprintf("%.3o",$1)' | shuf |
ghead -n 10 | gsort -n |
awk 'function ___(__, _) {
return int((__ ".") - (substr(__, ++_ + _, _) \
+ substr(__, _, _) * (++_ + _^_^_)) * _)
} BEGIN { CONVFMT = "%.250g" } ($++NF = ___($2))^_'
35 043 35
125 175 125
145 221 145
218 332 218
235 353 235
271 417 271
275 423 275
289 441 289
332 514 332
408 630 408
Upvotes: 0
Reputation: 2865
this separate post is made specifically to showcase how to extend the octal lookup reference tables in gawk
unicode
-mode to all 256 bytes
without external dependencies or warning messages:
ASCII
bytes reside in table o2bL
o2bH
.
# gawk profile, created Fri Sep 16 09:53:26 2022
'BEGIN {
1 makeOctalRefTables(PROCINFO["sorted_in"] = "@val_str_asc" \
(ORS = ""))
128 for (_ in o2bL) {
128 print o2bL[_]
}
128 for (_ in o2bH) {
128 print o2bH[_]
}
}
function makeOctalRefTables(_,__,___,____)
{
1 _=__=___=____=""
for (_ in o2bL) {
break
}
1 if (!(_ in o2bL)) {
1 ____=_+=((_+=_^=_<_)-+-++_)^_--
128 do { o2bL[sprintf("\\%o",_)] = \
sprintf("""%c",_)
} while (_--)
1 o2bL["\\" ((_+=(_+=_^=_<_)+_)*_--+_+_)] = "\\&"
1 ___=--_*_^_--*--_*++_^_*(_^=++_)^(! —_)
128 do { o2bH[sprintf("\\%o", +_)] = \
sprintf("%c",___+_)
} while (____<--_)
}
1 return length(o2bL) ":" length(o2bH)
}'
|
\0 \1 \2 \3 \4 \5 \6 \7 \10\11 \12
\13
\14
\16 \17
\20 \21 \22 \23 \24 \25 \26 \27 \30 \31 \32 \33 34 \35 \36 \37
\40 \41 !\42 "\43 #\44 $\45 %\47 '\50 (\51 )\52 *\53 +\54 ,\55 -\56 .\57 /
\60 0\61 1\62 2\63 3\64 4\65 5\66 6\67 7\70 8\71 9\72 :\73 ;\74 <\75 =\76 >\77 ?
\100 @\101 A\102 B\103 C\104 D\105 E\106 F\107 G\110 H\111 I\112 J\113 K\114 L\115 M\116 N\117 O
\120 P\121 Q\122 R\123 S\124 T\125 U\126 V\127 W\130 X\131 Y\132 Z\133 [\134 \\46 \&\135 ]\136 ^\137 _
\140 `\141 a\142 b\143 c\144 d\145 e\146 f\147 g\150 h\151 i\152 j\153 k\154 l\155 m\156 n\157 o
\160 p\161 q\162 r\163 s\164 t\165 u\166 v\167 w\170 x\171 y\172 z\173 {\174 |\175 }\176 ~\177
\200 ?\201 ?\202 ?\203 ?\204 ?\205 ?\206 ?\207 ?\210 ?\211 ?\212 ?\213 ?\214 ?\215 ?\216 ?\217 ?
\220 ?\221 ?\222 ?\223 ?\224 ?\225 ?\226 ?\227 ?\230 ?\231 ?\232 ?\233 ?\234 ?\235 ?\236 ?\237 ?
\240 ?\241 ?\242 ?\243 ?\244 ?\245 ?\246 ?\247 ?\250 ?\251 ?\252 ?\253 ?\254 ?\255 ?\256 ?\257 ?
\260 ?\261 ?\262 ?\263 ?\264 ?\265 ?\266 ?\267 ?\270 ?\271 ?\272 ?\273 ?\274 ?\275 ?\276 ?\277 ?
\300 ?\301 ?\302 ?\303 ?\304 ?\305 ?\306 ?\307 ?\310 ?\311 ?\312 ?\313 ?\314 ?\315 ?\316 ?\317 ?
\320 ?\321 ?\322 ?\323 ?\324 ?\325 ?\326 ?\327 ?\330 ?\331 ?\332 ?\333 ?\334 ?\335 ?\336 ?\337 ?
\340 ?\341 ?\342 ?\343 ?\344 ?\345 ?\346 ?\347 ?\350 ?\351 ?\352 ?\353 ?\354 ?\355 ?\356 ?\357 ?
\360 ?\361 ?\362 ?\363 ?\364 ?\365 ?\366 ?\367 ?\370 ?\371 ?\372 ?\373 ?\374 ?\375 ?\376 ?\377 ?
Upvotes: 0
Reputation: 17208
I got my own POSIX awk
solution, so I post it here for reference.
The main idea is to build a hash that translates an octal escape sequence to its corresponding character. You can then use it while splitting the line during the search for escape sequences:
LANG=C awk '
BEGIN {
for ( i = 1; i <= 255; i++ )
tr[ sprintf("\\%03o",i) ] = sprintf("%c",i)
}
{
remainder = $0
while ( match(remainder, /\\[0-7]{3}/) ) {
printf("%s%s", \
substr(remainder, 1, RSTART-1), \
tr[ substr(remainder, RSTART, RLENGTH) ] \
)
remainder = substr(remainder, RSTART + RLENGTH)
}
print remainder
}
' input.txt
backslash `\`
single quote `'` and double quote `"`
linefeed `
` and carriage return `
%s &
etc...
Upvotes: 1
Reputation: 2865
UPDATE :: about gawk
's strtonum()
in unicode mode :
echo '\666' |
LC_ALL='en_US.UTF-8' gawk -e '
$++NF = "<( "(sprintf("%c", strtonum((_=_<_) substr($++_, ++_))))" )>"'
0000000 909522524 539507744 690009798 2622
\ 6 6 6 < ( ƶ ** ) > \n
134 066 066 066 040 074 050 040 306 266 040 051 076 012
\ 6 6 6 sp < ( sp ? ? sp ) > nl
92 54 54 54 32 60 40 32 198 182 32 41 62 10
5c 36 36 36 20 3c 28 20 c6 b6 20 29 3e 0a
0000016
By default, gawk
in unicode mode would decode out a multi-byte character instead of byte \266 | 0xB6
. If you wanna ensure consistency of always decoding out a single-byte out, even in gawk
unicode mode, this should do the trick :
echo '\666' |
LC_ALL='en_US.UTF-8' gawk -e '$++NF = sprintf("<( %c )>",
strtonum((_=_<_) substr($++_, ++_)) + _*++_^_++*_^++_)'
0000000 909522524 539507744 1042882742 10
\ 6 6 6 < ( 266 ) > \n
134 066 066 066 040 074 050 040 266 040 051 076 012
\ 6 6 6 sp < ( sp ? sp ) > nl
92 54 54 54 32 60 40 32 182 32 41 62 10
5c 36 36 36 20 3c 28 20 b6 20 29 3e 0a
0000015
long story short : add 4^5 * 54
to output of strtonum()
, which happens to be 0xD800
, the starting point of UTF-16 surrogates
=================== =================== ===================
one quick note about @Gene
's proposed perl
-based solution :
echo 'abc \555 456' | perl -p -e 's/\\([0-7]{3})/chr(oct($1))/ge'
Wide character in print at -e line 1, <> line 1.
abc ŭ 456
octal codes wrap around, meaning \4xx = \0xx ; \6xx = \2xx
etc :
printf '\n %s\n' $'\555'
m
so perl
is incorrectly decoding these as multi-byte characters, when in fact \555
, as confirmed by printf
, is merely lowercase "m" (0x6D)
ps : my perl
is version 5.34
Upvotes: 1
Reputation: 22032
With GNU awk
which supports strtonum()
function, would you
please try:
awk '{
while (match($0, /\\[0-7]{1,3}/)) {
printf("%s", substr($0, 1, RSTART - 1)) # print the substring before the match
printf("%c", strtonum("0" substr($0, RSTART + 1, RLENGTH))) # convert the octal string to character
$0 = substr($0, RSTART + RLENGTH) # update $0 with remaining substring
}
print
}' input_file
while
loop one by one.substr($0, RSTART + 1, RLENGTH)
skips the leading backslash."0"
prepended to substr
makes an octal string.strtonum()
converts the octal string to the numeric value.print
outputs the remaining substring.Upvotes: 1
Reputation: 203995
Using GNU awk for strtonum()
and lots of meaningfully-named variables to show what each step does:
$ cat tst.awk
function octs2chars(str, head,tail,oct,dec,char) {
head = ""
tail = str
while ( match(tail,/\\[0-7]{3}/) ) {
oct = substr(tail,RSTART+1,RLENGTH-1)
dec = strtonum(0 oct)
char = sprintf("%c", dec)
head = head substr(tail,1,RSTART-1) char
tail = substr(tail,RSTART+RLENGTH)
}
return head tail
}
{ print octs2chars($0) }
$ awk -f tst.awk file
backslash \ is escaped as \134
single quote ' and double quote "
linefeed `
` and carriage return `
%s &
etc...
If you don't have GNU awk then write a small function to convert octal to decimal, e.g. oct2dec()
below, and then call that instead of strtonum()
:
$ cat tst2.awk
function oct2dec(oct, dec) {
dec = substr(oct,1,1) * 8 * 8
dec += substr(oct,2,1) * 8
dec += substr(oct,3,1)
return dec
}
function octs2chars(str, head,tail,oct,dec,char) {
head = ""
tail = str
while ( match(tail,/\\[0-7]{3}/) ) {
oct = substr(tail,RSTART+1,RLENGTH-1)
dec = oct2dec(oct) # replaced "strtonum(0 oct)"
char = sprintf("%c", dec)
head = head substr(tail,1,RSTART-1) char
tail = substr(tail,RSTART+RLENGTH)
}
return head tail
}
{ print octs2chars($0) }
$ awk -f tst2.awk file
backslash \ is escaped as \134
single quote ' and double quote "
linefeed `
` and carriage return `
%s &
etc...
The above assumes that, as discussed in comments, the only backslashes in the input will be in the context of the start of octal numbers as shown in the provided sample input.
Upvotes: 2