Chas. Owens
Chas. Owens

Reputation: 64939

How can I make Perl 6 be round-trip safe for Unicode data?

A naïve Perl 6 program is not round-trip safe with respect to Unicode. It appears as if it internally uses Normalization Form Composition (NFC) for the Str type:

$ perl -CO -E 'say "e\x{301}"' | perl6 -ne '.say' | perl -CI -ne 'printf "U+%04x\n", ord for split //'
U+00e9
U+000a

Poking through the docs I can't see anything about this behavior and I find it very shocking. I can't believe you have to drop back to the byte level to round-trip text:

$ perl -CO -E 'say "e\x{301}"' | perl6 -e 'while (my $byte = $*IN.read(1)) { $*OUT.write($byte) }' | perl -CI -ne 'printf "U+%04x\n", ord for split //'
U+0065
U+0301
U+000a

Do all text files have to be in NFC to be safely round-tripped with Perl 6? What if the document is supposed to be in NFD? I must be missing something here. I cannot believe this is intentional behavior.

Upvotes: 8

Views: 204

Answers (2)

Christopher Bottoms
Christopher Bottoms

Reputation: 11193

Use UTF8-C8. From the documentation:

You can use UTF8-C8 with any file handle to read the exact bytes as they are on disk. They may look funny when printed out, if you print it out using a UTF8 handle. If you print it out to a handle where the output is UTF8-C8, then it will render as you would normally expect, and be a byte for byte exact copy.

Upvotes: 3

Chas. Owens
Chas. Owens

Reputation: 64939

The answer seems to be to use the Uni type (the base class for NFD, NFC, etc), but it doesn't really do that now and there is no good way to get the file into a Uni string. So, until some unnamed point in the future, you cannot roundtrip a non-normalized file unless you treat it as bytes.

Upvotes: 6

Related Questions