Unescape C-like strings from GetText with perl

Question

I'm trying to interpolate the double-quoted strings as defined in the PO file format format.

After some testing I could determine that the only recognized escape sequences are \a, \b, \f, , , , \v, \, \", \xh..., \o, \oo, and \ooo (with h and o standing for hexadecimal and octal digits respectively).

Now suppose that you have the following file.po (the double-quoted string of the msgstr line is the one of interest):

msgid  "test"
msgstr "BEL \a BS \b FF \f LF 
 CR 
 HT 	 VT? \v BSOL \ QUOT \" HEX? \x01009 Y2 \1712"

I wrote a function to interpolate those strings, leveraging eval:

sub uncstring {
    local $_ = substr((shift), 1, -1); # trim surrounding double-quotes
    s/
        \ ( [abfnrtv\"] | x [[:xdigit:]]+ | [0-7]{1,3} )
    /
        eval "\"\$1\""
    /xge;
    return $_;
}

When I use it with file.po:

perl -MB -lne '
    BEGIN{ sub uncstring { ... } }
    print B::cstring(uncstring($1)) if /^msgstr (".*")/;
' file.po

I get (after escaping the output with B::cstring):

"BEL \a BS \b FF \f LF 
 CR 
 HT 	 VT? v BSOL \ QUOT \" HEX? \001009 Y2 y2"

In comparison, when I use the GetText utility msgexec:

msgexec -i file.po 0 | perl -MB -lp0e '$_ = B::cstring($_);'

I get (after escaping the output with B::cstring):

"BEL \a BS \b FF \f LF 
 CR 
 HT 	 VT? \v BSOL \ QUOT \" HEX? 	 Y2 y2"

As you can see, both ouput differ for the escape sequences marked by VT? and HEX?.

How can I fix my uncstring function for it to interpret the escape sequences like GetText does?

Fravadona · Accepted Answer

There were two problems with using eval to decode the PO strings:

Perl doesn't know about \v (thanks @choroba), so eval converts it to v instead of a literal VT.
GetText reads \x escape sequences of any length, but it only keeps the least significant byte (like in standard C); for eg. \x01009 is equivalent to \x09 and is translated to a literal HT.

Here's the final code of a function that decodes PO strings; it uses a hash to store the translations of the single-char escape sequences, and uses the hex & oct functions to convert the hexadecimal and octal escape sequences into an integer; the obtained number is then stripped down to its least significant byte before being translated to a literal character.

use 5.10;
use feature qw{state switch};

sub po_unqqbackslash {
    # Associate the "alpha char" of a single-char escape sequence
    # to the "decoded value" of that escape sequence:
    state $decode = {
        "a"  => "\a",
        "b"  => "\b",
        "f"  => "\f",
        "n"  => "
",
        "r"  => "
",
        "t"  => "	",
        "v"  => "\x0B",
        "\" => "\",
        "\"" => "\"",
    };
    # Run a global substitution on the string argument (trimmed
    # of its leading and trailing double-quotes).
    return substr(shift, 1, -1) =~ s{
        \ (?:
            (? [abfnrtv\"]  ) | # single-char escape sequence
          x (? [[:xdigit:]]+ ) | # hexadecimal escape sequence
            (? [0-7]{1,3}    )   # octal escape sequence
        )
    } [
        # NOTE: the captured group name is in fact the only key of %+
        my ($group) = %+;
        given($group) {
            when('chr') { $decode->{$+{$group}}      }
            when('hex') { chr(0xFF & hex $+{$group}) }
            when('oct') { chr(0xFF & oct $+{$group}) }
        }
    ]xager;
}

^{note: A special thanks to @ikegami and @briandfoy for their tips}

Unescape C-like strings from GetText with perl

Answers (2)

Related Questions