Why is umlaut not recognized in a UTF-8-encoded Perl script with "use utf8"?

Question

The following script is encoded in UTF-8:

use utf8;

$fuer = pack('H*', '66c3bc72');

$fuer =~ s/ü/!!!/;

print $fuer;

The ü in the s/// is stored in the script as c3 bc, as the following xxd hex dump shows.

0000000: 75 73 65 20 75 74 66 38 3b 0a 0a 24 66 75 65 72  use utf8;..$fuer
0000010: 20 3d 20 70 61 63 6b 28 27 48 2a 27 2c 20 27 36   = pack('H*', '6
0000020: 36 63 33 62 63 37 32 27 29 3b 0a 0a 24 66 75 65  6c3bc72');..$fue
0000030: 72 20 3d 7e 20 73 2f c3 bc 2f 21 21 21 2f 3b 0a  r =~ s/../!!!/;.
0000040: 0a 70 72 69 6e 74 20 24 66 75 65 72 3b 0a        .print $fuer;.

c3 bc is the UTF-8 representation for ü.

Since the script is encoded in UTF-8 and I am useing utf8, I expected the script to replace the für in the variable $fuer - yet it doesn't.

It does, however, if I remove the use utf8. This runs against what I thought use utf8 was for: to indicate that the script is encoded in UTF-8.

Borodin · Accepted Answer

The problem is with character boundaries. You are comparing an encoded string of bytes with a decoded character string

$fuer = pack('H*', '66c3bc72') creates the four-byte string "\x66\xc3\xbc\x72", whereas a small u with diaeresis ü is "\xfc" so the two don't match

If you used decode_utf8 from the Encode module to further process your variable $fuer then it would decode the UTF-8 to form the three-character string "\x66\xfc\x72", and the substitute would then work

use utf8 applies the equivalent to decode_utf8 to the whole source file, so without it your ü appears encoded as "\xc3\xbc", which matches the packed variable

Why is umlaut not recognized in a UTF-8-encoded Perl script with "use utf8"?

Answers (2)

Related Questions

Why is umlaut not recognized in a UTF-8-encoded Perl script with &quot;use utf8&quot;?

Answers (2)

Related Questions

Why is umlaut not recognized in a UTF-8-encoded Perl script with "use utf8"?