Reputation: 40778
I am trying to understand how Perl handles unicode.
use feature qw(say);
use strict;
use warnings;
use Encode qw(encode);
say unpack "H*", pack("U", 0xff);
say unpack "H*", encode( 'UTF-8', chr 0xff );
Output:
ff
c3bf
Why do I get ff
and not c3bf
when using pack ?
Upvotes: 2
Views: 226
Reputation: 30235
Why do I get ff and not c3bf when using pack ?
This is because pack creates a character string, not a byte string.
> perl -MDevel::Peek -e 'Dump(pack("U", 0xff));'
SV = PV(0x13a6d18) at 0x13d2ce8
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK,UTF8)
PV = 0xa6d298 "\303\277"\0 [UTF8 "\x{ff}"]
CUR = 2
LEN = 32
Hence unpack("H*") doesn't look at the byte-value of that string, but the (truncated) character value of it. If you'd do:
say unpack "H*", encode("UTF-8", pack("U", 0xff));
Then you'd get the expected result.
See also this thread.
Upvotes: 2
Reputation: 386501
pack('U', 0xFF)
is just a weird way of doing
chr(0xFF)
so
"\xFF" returns chars FF
chr(0xFF) returns chars FF
pack('U', 0xFF) returns chars FF
"\xC3\xBF" returns chars C3 BF
encode('UTF-8', chr(0xFF)) returns chars C3 BF
encode('UTF-8', pack('U', 0xFF)) returns chars C3 BF
so
say unpack "H*", "\xFF"; outputs ff
say unpack "H*", chr(0xFF); outputs ff
say unpack "H*", pack('U', 0xFF); outputs ff
say unpack "H*", "\xC3\xBF"; outputs c3bf
say unpack "H*", encode('UTF-8', pack('U', 0xFF)); outputs c3bf
say unpack "H*", encode('UTF-8', chr(0xFF)); outputs c3bf
Upvotes: 2