Reputation: 40499
The following script is encoded in UTF-8:
use utf8;
$fuer = pack('H*', '66c3bc72');
$fuer =~ s/ü/!!!/;
print $fuer;
The ü
in the s///
is stored in the script as c3 bc
, as the following xxd
hex dump shows.
0000000: 75 73 65 20 75 74 66 38 3b 0a 0a 24 66 75 65 72 use utf8;..$fuer
0000010: 20 3d 20 70 61 63 6b 28 27 48 2a 27 2c 20 27 36 = pack('H*', '6
0000020: 36 63 33 62 63 37 32 27 29 3b 0a 0a 24 66 75 65 6c3bc72');..$fue
0000030: 72 20 3d 7e 20 73 2f c3 bc 2f 21 21 21 2f 3b 0a r =~ s/../!!!/;.
0000040: 0a 70 72 69 6e 74 20 24 66 75 65 72 3b 0a .print $fuer;.
c3 bc
is the UTF-8 representation for ü
.
Since the script is encoded in UTF-8 and I am use
ing utf8
, I expected the script to replace the für
in the variable $fuer
- yet it doesn't.
It does, however, if I remove the use utf8
. This runs against what I thought use utf8
was for: to indicate that the script is encoded in UTF-8.
Upvotes: 6
Views: 1825
Reputation: 385897
Let's move out the ü
out of the s///
and into its own variable so we can inspect it.
use utf8; # Script is encoded using UTF-8
use open ':std', ':encoding(UTF-8)'; # Terminal expects UTF-8.
use strict;
use warnings;
my $uuml = "ü";
printf("%d %vX %s", length($uuml), $uuml, $uuml); # 1 FC ü
my $fuer = pack('H*', '66c3bc72');
printf("%d %vX %s", length($fuer), $fuer, $fuer); # 4 66.C3.BC.72 für
$fuer =~ s/\Q$uuml/!!!/;
printf("%d %vX %s", length($fuer), $fuer, $fuer); # 4 66.C3.BC.72 für
As this makes obvious, you are comparing the Unicode Code Point of ü
(FC
) against the UTF-8 encoding of ü
(C3 BC
).
So yes, use utf8;
indicates that script is encoded using UTF-8 ...but it does it so that Perl can correctly decode the script.
Decode all inputs and encode all outputs! The solution is to replace
my $fuer = pack('H*', '66c3bc72');
with
use Encode qw( decode_utf8 );
my $fuer = decode_utf8(pack('H*', '66c3bc72'));
or
my $fuer = pack('H*', '66c3bc72');
utf8::decode($fuer);
Upvotes: 4
Reputation: 126722
The problem is with character boundaries. You are comparing an encoded string of bytes with a decoded character string
$fuer = pack('H*', '66c3bc72')
creates the four-byte string "\x66\xc3\xbc\x72"
, whereas a small u with diaeresis ü
is "\xfc"
so the two don't match
If you used decode_utf8
from the Encode
module to further process your variable $fuer
then it would decode the UTF-8 to form the three-character string "\x66\xfc\x72"
, and the substitute would then work
use utf8
applies the equivalent to decode_utf8
to the whole source file, so without it your ü
appears encoded as "\xc3\xbc"
, which matches the packed variable
Upvotes: 9