Reputation: 17216
I'm trying to interpolate the double-quoted strings as defined in the PO file format format.
After some testing I could determine that the only recognized escape sequences are \a
, \b
, \f
, \n
, \r
, \t
, \v
, \\
, \"
, \xh...
, \o
, \oo
, and \ooo
(with h
and o
standing for hexadecimal and octal digits respectively).
Now suppose that you have the following file.po
(the double-quoted string of the msgstr
line is the one of interest):
msgid "test"
msgstr "BEL \a BS \b FF \f LF \n CR \r HT \t VT? \v BSOL \\ QUOT \" HEX? \x01009 Y2 \1712"
I wrote a function to interpolate those strings, leveraging eval
:
sub uncstring {
local $_ = substr((shift), 1, -1); # trim surrounding double-quotes
s/
\\ ( [abfnrtv\\"] | x [[:xdigit:]]+ | [0-7]{1,3} )
/
eval "\"\\$1\""
/xge;
return $_;
}
When I use it with file.po
:
perl -MB -lne '
BEGIN{ sub uncstring { ... } }
print B::cstring(uncstring($1)) if /^msgstr (".*")/;
' file.po
I get (after escaping the output with B::cstring
):
"BEL \a BS \b FF \f LF \n CR \r HT \t VT? v BSOL \\ QUOT \" HEX? \001009 Y2 y2"
In comparison, when I use the GetText utility msgexec
:
msgexec -i file.po 0 | perl -MB -lp0e '$_ = B::cstring($_);'
I get (after escaping the output with B::cstring
):
"BEL \a BS \b FF \f LF \n CR \r HT \t VT? \v BSOL \\ QUOT \" HEX? \t Y2 y2"
As you can see, both ouput differ for the escape sequences marked by VT?
and HEX?
.
How can I fix my uncstring
function for it to interpret the escape sequences like GetText does?
Upvotes: 2
Views: 141
Reputation: 132896
I'd do something like this rather than try to cram a bunch of stuff into a single substitution. The OP has a very nice solution since I originally posted, but here's something that turns my original example into something that works. This isn't particularly superior to the OP's solution, but I like showing off these features because they make sense more complicated situations:
use open qw(:std :utf8);
use v5.10;
sub uncstring {
my( $s ) = @_;
$s =~ s/\A\h*"|"\h*\z//g;
state $convert = {
'a' => "\a",
'b' => "\b",
'f' => "\f",
'n' => "\n",
'r' => "\r",
't' => "\t",
'v' => chr(hex('B')),
'\\' => '\\',
'"' => '"',
};
my $r;
local $_ = $s;
while(pos() < length) {
$r .= do {
if( / \G ([^\x{5C}]+) /gcx ) { $1 }
elsif( / \G \x{5C} x ([0-9a-f]+) /gcx ) { chr(0xFF & hex($1)) }
elsif( / \G \x{5C} ([0-7]{1,3}) /gcx ) { chr(0xFF & oct($1)) }
elsif( / \G \x{5C} ([abfnrtv\x{5C}"]) /gcx ) { $convert->{$1} }
else { last; } # something is wrong if we are here
}
};
$r;
}
my $msgstr = '"BEL \a BS \b FF \f LF \n CR \r HT \t VT? \v BSOL \\\\ QUOT \" HEX? \x01009 Y2 \1712"';
say uncstring($msgstr);
Mostly I like showing off the \G
modifier with the /gc
flag. In scalar context, on a successful match Perl remembers the end of the last match on a string (which you can get with pos
) and you can anchor to that position with \G
. The /gc
in scalar context lets you try a pattern without resetting the pos
.
One of the advantages here is that as things get complex, the smaller regexes and their ordering are still tractable. This is especially true when the transformations for each branch are very different.
I go into that in detail in Mastering Perl, 2nd Edition (not first edition).
This is the original answer, where I misread the example data and thought that it would always have the form presented with whitespace separating things. Because of that assumption, I went down a path that doesn't work in general. It's still here so you can see what the comments are responding to.
This breaks up the string then looks at each part individually. If it matches some pattern, it transforms the matched part however you like.
There are various ways to golf this, but I don't like going too far with that.
use open qw(:std :utf8);
use v5.10;
sub uncstring {
my( $s ) = @_;
$s =~ s/\A\h*"|"\h*\z//g;
state $convert = {
'a' => "\a",
'b' => "\b",
'f' => "\f",
'n' => "\n",
'r' => "\r",
't' => "\t",
};
join '',
map {
if( / \A [^\x{5C}] /x ) { $_ }
elsif( / \A \x{5C} x ([0-9a-f]+) /x ) { chr(hex($1)) }
elsif( / \A \x{5C} ([0-7]+) /x ) { chr(oct($1)) }
elsif( / \A \x{5C} ([abfnrt]) /x ) { $convert->{$1} }
elsif( / \A \x{5C} ([v]) /x ) { chr(hex('B')) }
else { $_ }
}
split /(\s+)/, $s; # () for separator retention mode
}
my $msgstr = '"BEL \a BS \b FF \f LF \n CR \r HT \t VT? \v BSOL \\ QUOT \" HEX? \x01009 Y2 \1712"';
say uncstring($msgstr);
Upvotes: 4
Reputation: 17216
There were two problems with using eval
to decode the PO strings:
Perl doesn't know about \v
(thanks @choroba), so eval
converts it to v
instead of a literal VT.
GetText reads \x
escape sequences of any length, but it only keeps the least significant byte (like in standard C); for eg. \x01009
is equivalent to \x09
and is translated to a literal HT.
Here's the final code of a function that decodes PO strings; it uses a hash to store the translations of the single-char escape sequences, and uses the hex
& oct
functions to convert the hexadecimal and octal escape sequences into an integer; the obtained number is then stripped down to its least significant byte before being translated to a literal character.
use 5.10;
use feature qw{state switch};
sub po_unqqbackslash {
# Associate the "alpha char" of a single-char escape sequence
# to the "decoded value" of that escape sequence:
state $decode = {
"a" => "\a",
"b" => "\b",
"f" => "\f",
"n" => "\n",
"r" => "\r",
"t" => "\t",
"v" => "\x0B",
"\\" => "\\",
"\"" => "\"",
};
# Run a global substitution on the string argument (trimmed
# of its leading and trailing double-quotes).
return substr(shift, 1, -1) =~ s{
\\ (?:
(?<chr> [abfnrtv\\"] ) | # single-char escape sequence
x (?<hex> [[:xdigit:]]+ ) | # hexadecimal escape sequence
(?<oct> [0-7]{1,3} ) # octal escape sequence
)
} [
# NOTE: the captured group name is in fact the only key of %+
my ($group) = %+;
given($group) {
when('chr') { $decode->{$+{$group}} }
when('hex') { chr(0xFF & hex $+{$group}) }
when('oct') { chr(0xFF & oct $+{$group}) }
}
]xager;
}
note: A special thanks to @ikegami and @briandfoy for their tips
Upvotes: 4