Reputation: 64919
I have the string "re\x{0301}sume\x{0301}"
(which prints like this: résumé) and I want to reverse it to "e\x{0301}muse\x{0301}r"
(émusér). I can't use Perl's reverse
because it treats combining characters like "\x{0301}"
as separate characters, so I wind up getting "\x{0301}emus\x{0301}er"
( ́emuśer). How can I reverse the string, but still respect the combining characters?
Upvotes: 13
Views: 1119
Reputation: 12842
Perl6::Str->reverse
also works.
In the case of the string résumé
, you can also use the Unicode::Normalize
core module to change the string to a fully composed form (NFC
or NFKC
) before reverse
ing; however, this is not a general solution, because some combinations of base character and modifier have no precomposed Unicode codepoint.
Upvotes: 1
Reputation: 132802
The best answer is to use Unicode::GCString, as Sinan points out
I modified Chas's example a bit:
split
(doesn't work after 5.10, apparently, so I removed it)It's basically the same thing with a couple of tweaks.
use strict;
use warnings;
binmode STDOUT, ":utf8";
my $original = "re\x{0301}sume\x{0301}";
my $wrong = reverse $original;
my $right = join '', reverse split /(\X)/, $original;
print <<HERE;
original: [$original]
wrong: [$wrong]
right: [$right]
HERE
Upvotes: 8
Reputation: 118128
You can use Unicode::GCString:
Unicode::GCString treats Unicode string as a sequence of extended grapheme clusters defined by Unicode Standard Annex #29 [UAX #29].
#!/usr/bin/env perl
use utf8;
use strict;
use warnings;
use feature 'say';
use open qw(:std :utf8);
use Unicode::GCString;
my $x = "re\x{0301}sume\x{0301}";
my $y = Unicode::GCString->new($x);
my $wrong = reverse $x;
my $correct = join '', reverse @{ $y->as_arrayref };
say "$x -> $wrong";
say "$y -> $correct";
Output:
résumé -> ́emuśer résumé -> émusér
Upvotes: 2
Reputation: 11
Some of the other answers contain elements that don't work well. Here is a working example tested on Perl 5.12 and 5.14. Failing to specify the binmode will cause the output to generate error messages. Using a positive lookahead assertion (and no separator retention mode) in split will cause the output to be incorrect on my Macbook.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'unicode_strings';
binmode STDOUT, ":utf8";
my $original = "re\x{0301}sume\x{0301}";
my $wrong = reverse $original;
my $right = join '', reverse split /(\X)/, $original;
print "original: $original\n",
"wrong: $wrong\n",
"right: $right\n";
Upvotes: 0
Reputation: 64919
You can use the \X special escape (match a non-combining character and all of the following combining characters) with split
to make a list of graphemes (with empty strings between them), reverse the list of graphemes, then join
them back together:
#!/usr/bin/perl
use strict;
use warnings;
my $original = "re\x{0301}sume\x{0301}";
my $wrong = reverse $original;
my $right = join '', reverse split /(\X)/, $original;
print "original: $original\n",
"wrong: $wrong\n",
"right: $right\n";
Upvotes: 12