Reputation: 25117
Why does this print a U
and not a Ü
?
#!/usr/bin/env perl
use warnings;
use 5.014;
use utf8;
binmode STDOUT, ':utf8';
use charnames qw(:full);
my $string = "\N{LATIN CAPITAL LETTER U}\N{COMBINING DIAERESIS}";
while ( $string =~ /(\X)/g ) {
say $1;
}
# Output: U
Upvotes: 8
Views: 562
Reputation: 80384
Your code is correct.
You really do need to play these things by the numbers; don’t trust what a "terminal" displays. Pipe it through the uniquote program, probably with -x
or -v
, and see what it is really doing.
Eyes deceive, and programs are even worse. Your terminal program is buggy, so is lying to you. Normalization shouldn’t matter.
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say "crème brûlée"'
crème brûlée
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say "crème brûlée"' | uniquote -x
cr\x{E8}me br\x{FB}l\x{E9}e
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFD "crème brûlée"'
crème brûlée
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFD "crème brûlée"' | uniquote -x
cre\x{300}me bru\x{302}le\x{301}e
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFC scalar reverse NFD "crème brûlée"'
éel̂urb em̀erc
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFC scalar reverse NFD "crème brûlée")' | uniquote -x
\x{E9}el\x{302}urb em\x{300}erc
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say scalar reverse NFD "crème brûlée"'
éel̂urb em̀erc
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say scalar reverse NFD "crème brûlée"' | uniquote -x
e\x{301}el\x{302}urb em\x{300}erc
Upvotes: 8
Reputation: 967
1) Apparently, your terminal can't display extended characters. On my terminal, it prints:
U¨
2) \X
doesn't do what you think it does. It merely selects characters that go together. If you use the string "fu\N{COMBINING DIAERESIS}r"
, your program displays:
f
u¨
r
Note how the diacritic mark isn't printed alone but with its corresponding character.
3) To combine all related characters in one, use the module Unicode::Normalize:
use Unicode::Normalize;
my $string = "fu\N{COMBINING DIAERESIS}r";
$string = NFC($string);
while ( $string =~ /(\X)/g ) {
say $1;
}
It displays:
f
ü
r
Upvotes: 1
Reputation: 106385
May I suggest it's the output which is incorrect? It's easy to check: replace your loop code with:
my $counter;
while ( $string =~ /(\X)/g ) {
say ++$counter, ': ', $1;
}
... and look up how many times the regex will match. My guess it will still match only once.
Alternatively, you can use this code:
use Encode;
sub codepoint_hex {
sprintf "%04x", ord Encode::decode("UTF-8", shift);
}
... and then print codepoint_hex ($1) instead of plain $1 within the while loop.
Upvotes: 1
Reputation: 20270
This works for me, though I have an older version of perl, 5.012
, on ubuntu. My only change to your script is: use 5.012;
$ perl so.pl
Ü
Upvotes: 3