sid_com
sid_com

Reputation: 25107

Question about the "utf-8"-behavior

#!/usr/bin/env perl
use warnings;
use 5.012;
use Encode qw(encode);

no warnings qw(utf8);

my $c = "\x{ffff}";

my $utf_8 = encode( 'utf-8', $c );
my $utf8 = encode( 'utf8', $c );

say "utf-8 :  @{[ unpack '(B8)*', $utf_8 ]}";
say "utf8  :  @{[ unpack '(B8)*', $utf8 ]}";

# utf-8 :  11101111 10111111 10111101
# utf8  :  11101111 10111111 10111111

Does the "utf-8" encode this way, to fix my codepoint automaticaly to the last interchangeable codepoint (of the first plane)?

Upvotes: 3

Views: 477

Answers (1)

cjm
cjm

Reputation: 62089

See the UTF-8 vs. utf8 vs. UTF8 section of the Encode docs.

To summarize, Perl has two different UTF-8 encodings. Its native encoding is called utf8, and basically allows any codepoint, regardless of what the Unicode standard says about that codepoint.

The other encoding is called utf-8 (a.k.a. utf-8-strict). This allows only codepoints that are assigned by the Unicode standard.

\x{FFFF} is not a valid codepoint according to Unicode. But Perl's utf8 encoding doesn't care about that.

By default, the encode function replaces any character that does not exist in the destination charset with a substitution character (see the Handling Malformed Data section). For utf-8, that substitution character is U+FFFD (REPLACEMENT CHARACTER), which is encoded in UTF-8 as 11101111 10111111 10111101 (binary).

Upvotes: 7

Related Questions