Eugene Barsky
Eugene Barsky

Reputation: 6002

Perl 6 transforms combined graphemes?

How is it possible that two codepoints are transformed into one? And if it's the default way of treating combined graphemes, how to avoid it?

> my $a = "a" ~ 0x304.chr
ā
> $a.codes
1
> $a.ords
(257)

UPD: Upon reading the documentation I see that already all the input is normalized:

Perl6 applies normalization by default to all input and output except for file names which are stored as UTF8-C8.

So, is there a method to avoid the normalization, i.e. to get the input and to treat it without any changes in the encoding?

Upvotes: 2

Views: 131

Answers (1)

callyalater
callyalater

Reputation: 3102

According to a Unicode report (see here), some characters have multiple ways of being represented. Per that report:

Certain characters are known as singletons. They never remain in the text after normalization. Examples include the angstrom and ohm symbols, which map to their normal letter counterparts a-with-ring and omega, respectively.

...

Many characters are known as canonical composites, or precomposed characters. In the D forms, they are decomposed; in the C forms, they are usually precomposed.

In the example you provided, $a contains a string that can be represented in two ways. First, it corresponds to U+0101 (LATIN SMALL LETTER A WITH MACRON) which is a Unicode codepoint. Second, it can be represented as two codepoints that combine to form an equivalent character (U+0061 [LATIN SMALL LETTER A] followed by U+0304 [COMBINING MACRON]).

These two representations are the basis for NFC and NFD. These are called normalized forms because they allow characters to be regularly represented using either the most concise or most deconstructed representation available. Some combined characters may have two entries in the Unicode table (such as Ohm and Big Omega), but the normalized form maps to only one entry.

NFD decomposes all of the characters into a list of all the codepoints used to make those characters, making sure not to use the precomposed character instead.

Perl6 automatically uses the NFC representation, but you can get the NFD (or Decomposed) version by using the NFD conversion method on Str.

my $a = "a" ~ 0x304.chr;

say $a.codes;                 # OUTPUT: 1
                              # This is because the concatenation
                              # is using NFC by default.
say $a.ords;                  # OUTPUT: (257)

say $a.NFD.codes;             # OUTPUT: 2
say $a.NFD.list;              # OUTPUT: (97 772)

NFC and NFD are both useful, but are intended for distinct purposes. As far as I can tell, there is no way to avoid normalization on input, but you can convert the input to whichever representation you need using the NFC and NFD conversion methods.

Upvotes: 4

Related Questions