Fravadona
Fravadona

Reputation: 17208

decoding octal escape sequences with awk

Let's suppose that you got octal escape sequences in a stream:

backslash \134 is escaped as \134134
single quote ' and double quote \042
linefeed `\012` and carriage return `\015`
%s &
etc...

note: In my input the escaped characters are limited to 0x01-0x1F 0x22 0x5C 0x7F

How can you revert those escape sequences back to their corresponding character with awk?

While awk is able to understand them out-of-box when used in a literal string or as parameter argument, I can't find the way to leverage this capability when the escape sequence is part of the data. For now I'm using one gsub per escape sequence but it doesn't feel efficient.

Here's the expected output for the given sample:

backslash \ is escaped as \134
single quote ' and double quote "
linefeed `
` and carriage return `
%s &
etc...

Upvotes: 1

Views: 1004

Answers (6)

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2865

you can decode any 3-digit octal value by just manually interpreting the first 2 digits :


jot 512 0 | mawk '$++NF = sprintf("%.3o",$1)' | shuf |

ghead -n 10 | gsort -n | 
awk 'function ___(__, _) {

          return int((__ ".") - (substr(__, ++_ + _, _) \
               + substr(__, _, _) * (++_ + _^_^_)) * _)

     } BEGIN { CONVFMT = "%.250g" } ($++NF = ___($2))^_'

 35 043  35
125 175 125
145 221 145
218 332 218
235 353 235
271 417 271
275 423 275
289 441 289
332 514 332
408 630 408

Upvotes: 0

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2865

this separate post is made specifically to showcase how to extend the octal lookup reference tables in gawk unicode-mode to all 256 bytes without external dependencies or warning messages:

  • ASCII bytes reside in table o2bL
  • 8-bit bytes reside in table o2bH

.

   # gawk profile, created Fri Sep 16 09:53:26 2022

   'BEGIN {
 1      makeOctalRefTables(PROCINFO["sorted_in"] = "@val_str_asc" \
                                            (ORS = ""))
128     for (_ in o2bL) {
128        print o2bL[_]
        }
128     for (_ in o2bH) {
128        print o2bH[_]
        }
   }

   function makeOctalRefTables(_,__,___,____)
   {
 1    _=__=___=____=""
      for (_ in o2bL) {
         break
      }
 1    if (!(_ in o2bL)) {

 1        ____=_+=((_+=_^=_<_)-+-++_)^_--

128        do { o2bL[sprintf("\\%o",_)] = \
                     sprintf("""%c",_)

           } while (_--)

 1         o2bL["\\" ((_+=(_+=_^=_<_)+_)*_--+_+_)] = "\\&"

 1         ___=--_*_^_--*--_*++_^_*(_^=++_)^(! —_)

128        do { o2bH[sprintf("\\%o", +_)] = \
                     sprintf("%c",___+_)

           } while (____<--_)
       }
 1     return length(o2bL) ":" length(o2bH)
   }'

|

\0 \1 \2 \3 \4 \5 \6 \7 \10\11    \12 
\13 
    \14 
\16 \17 
\20 \21 \22 \23 \24 \25 \26 \27 \30 \31 \32 \33 34 \35 \36 \37 
\40  \41 !\42 "\43 #\44 $\45 %\47 '\50 (\51 )\52 *\53 +\54 ,\55 -\56 .\57 /
\60 0\61 1\62 2\63 3\64 4\65 5\66 6\67 7\70 8\71 9\72 :\73 ;\74 <\75 =\76 >\77 ?
\100 @\101 A\102 B\103 C\104 D\105 E\106 F\107 G\110 H\111 I\112 J\113 K\114 L\115 M\116 N\117 O
\120 P\121 Q\122 R\123 S\124 T\125 U\126 V\127 W\130 X\131 Y\132 Z\133 [\134 \\46 \&\135 ]\136 ^\137 _
\140 `\141 a\142 b\143 c\144 d\145 e\146 f\147 g\150 h\151 i\152 j\153 k\154 l\155 m\156 n\157 o
\160 p\161 q\162 r\163 s\164 t\165 u\166 v\167 w\170 x\171 y\172 z\173 {\174 |\175 }\176 ~\177 
\200 ?\201 ?\202 ?\203 ?\204 ?\205 ?\206 ?\207 ?\210 ?\211 ?\212 ?\213 ?\214 ?\215 ?\216 ?\217 ?
\220 ?\221 ?\222 ?\223 ?\224 ?\225 ?\226 ?\227 ?\230 ?\231 ?\232 ?\233 ?\234 ?\235 ?\236 ?\237 ?
\240 ?\241 ?\242 ?\243 ?\244 ?\245 ?\246 ?\247 ?\250 ?\251 ?\252 ?\253 ?\254 ?\255 ?\256 ?\257 ?
\260 ?\261 ?\262 ?\263 ?\264 ?\265 ?\266 ?\267 ?\270 ?\271 ?\272 ?\273 ?\274 ?\275 ?\276 ?\277 ?
\300 ?\301 ?\302 ?\303 ?\304 ?\305 ?\306 ?\307 ?\310 ?\311 ?\312 ?\313 ?\314 ?\315 ?\316 ?\317 ?
\320 ?\321 ?\322 ?\323 ?\324 ?\325 ?\326 ?\327 ?\330 ?\331 ?\332 ?\333 ?\334 ?\335 ?\336 ?\337 ?
\340 ?\341 ?\342 ?\343 ?\344 ?\345 ?\346 ?\347 ?\350 ?\351 ?\352 ?\353 ?\354 ?\355 ?\356 ?\357 ?
\360 ?\361 ?\362 ?\363 ?\364 ?\365 ?\366 ?\367 ?\370 ?\371 ?\372 ?\373 ?\374 ?\375 ?\376 ?\377 ?

 

Upvotes: 0

Fravadona
Fravadona

Reputation: 17208

I got my own POSIX awk solution, so I post it here for reference.

The main idea is to build a hash that translates an octal escape sequence to its corresponding character. You can then use it while splitting the line during the search for escape sequences:

LANG=C awk '
    BEGIN {
        for ( i = 1; i <= 255; i++ )
            tr[ sprintf("\\%03o",i) ] = sprintf("%c",i)
    }
    {
        remainder = $0
        while ( match(remainder, /\\[0-7]{3}/) ) {
            printf("%s%s", \
                   substr(remainder, 1, RSTART-1), \
                   tr[ substr(remainder, RSTART, RLENGTH) ] \
            )
            remainder = substr(remainder, RSTART + RLENGTH)
        }
        print remainder
    }
' input.txt
backslash `\`
single quote `'` and double quote `"`
linefeed `
` and carriage return `
%s &
etc...

Upvotes: 1

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2865

UPDATE :: about gawk's strtonum() in unicode mode :

echo '\666' | 

  LC_ALL='en_US.UTF-8' gawk -e '

  $++NF = "<( "(sprintf("%c", strtonum((_=_<_) substr($++_, ++_))))" )>"'

0000000         909522524       539507744       690009798            2622
           \   6   6   6       <   (       ƶ  **       )   >  \n        
          134 066 066 066 040 074 050 040 306 266 040 051 076 012        
           \   6   6   6  sp   <   (  sp   ?   ?  sp   )   >  nl        
           92  54  54  54  32  60  40  32 198 182  32  41  62  10        
           5c  36  36  36  20  3c  28  20  c6  b6  20  29  3e  0a        

0000016

By default, gawk in unicode mode would decode out a multi-byte character instead of byte \266 | 0xB6. If you wanna ensure consistency of always decoding out a single-byte out, even in gawk unicode mode, this should do the trick :

echo '\666' | 

     LC_ALL='en_US.UTF-8' gawk -e '$++NF = sprintf("<( %c )>",

                 strtonum((_=_<_) substr($++_, ++_)) + _*++_^_++*_^++_)'

0000000         909522524       539507744      1042882742              10
           \   6   6   6       <   (     266       )   >  \n            
          134 066 066 066 040 074 050 040 266 040 051 076 012            
           \   6   6   6  sp   <   (  sp   ?  sp   )   >  nl            
           92  54  54  54  32  60  40  32 182  32  41  62  10            
           5c  36  36  36  20  3c  28  20  b6  20  29  3e  0a            

0000015

long story short : add 4^5 * 54 to output of strtonum(), which happens to be 0xD800, the starting point of UTF-16 surrogates

=================== =================== ===================

one quick note about @Gene's proposed perl-based solution :

echo 'abc \555 456' | perl -p -e 's/\\([0-7]{3})/chr(oct($1))/ge'

Wide character in print at -e line 1, <> line 1.
abc ŭ 456

octal codes wrap around, meaning \4xx = \0xx ; \6xx = \2xx etc :

printf '\n %s\n' $'\555'

 m

so perl is incorrectly decoding these as multi-byte characters, when in fact \555, as confirmed by printf, is merely lowercase "m" (0x6D)

ps : my perl is version 5.34

Upvotes: 1

tshiono
tshiono

Reputation: 22032

With GNU awk which supports strtonum() function, would you please try:

awk '{
    while (match($0, /\\[0-7]{1,3}/)) {
        printf("%s", substr($0, 1, RSTART - 1))                      # print the substring before the match
        printf("%c", strtonum("0" substr($0, RSTART + 1, RLENGTH)))  # convert the octal string to character
        $0 = substr($0, RSTART + RLENGTH)                            # update $0 with remaining substring
    }
    print
}' input_file
  • It processes the matched substring (octal presentation) in the while loop one by one.
  • substr($0, RSTART + 1, RLENGTH) skips the leading backslash.
  • "0" prepended to substr makes an octal string.
  • strtonum() converts the octal string to the numeric value.
  • The final print outputs the remaining substring.

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203995

Using GNU awk for strtonum() and lots of meaningfully-named variables to show what each step does:

$ cat tst.awk
function octs2chars(str,        head,tail,oct,dec,char) {
    head = ""
    tail = str
    while ( match(tail,/\\[0-7]{3}/) ) {
        oct  = substr(tail,RSTART+1,RLENGTH-1)
        dec  = strtonum(0 oct)
        char = sprintf("%c", dec)
        head = head substr(tail,1,RSTART-1) char
        tail = substr(tail,RSTART+RLENGTH)
    }
    return head tail
}
{ print octs2chars($0) }

$ awk -f tst.awk file
backslash \ is escaped as \134
single quote ' and double quote "
linefeed `
` and carriage return `
%s &
etc...

If you don't have GNU awk then write a small function to convert octal to decimal, e.g. oct2dec() below, and then call that instead of strtonum():

$ cat tst2.awk
function oct2dec(oct,   dec) {
    dec =  substr(oct,1,1) * 8 * 8
    dec += substr(oct,2,1) * 8
    dec += substr(oct,3,1)
    return dec
}

function octs2chars(str,        head,tail,oct,dec,char) {
    head = ""
    tail = str
    while ( match(tail,/\\[0-7]{3}/) ) {
        oct  = substr(tail,RSTART+1,RLENGTH-1)
        dec  = oct2dec(oct)        # replaced "strtonum(0 oct)"
        char = sprintf("%c", dec)
        head = head substr(tail,1,RSTART-1) char
        tail = substr(tail,RSTART+RLENGTH)
    }
    return head tail
}
{ print octs2chars($0) }

$ awk -f tst2.awk file
backslash \ is escaped as \134
single quote ' and double quote "
linefeed `
` and carriage return `
%s &
etc...

The above assumes that, as discussed in comments, the only backslashes in the input will be in the context of the start of octal numbers as shown in the provided sample input.

Upvotes: 2

Related Questions