Amir E. Aharoni
Amir E. Aharoni

Reputation: 1318

How can I substitute in strings in Perl 6 by codepoint rather than by grapheme?

I need to remove diacritical marks from a string using Perl 6. I tried doing this:

my $hum = 'חוּם';
$ahm.subst(/<-[\c[HEBREW LETTER ALEF] .. \c[HEBREW LETTER TAV]]>/, '', :g);

I am trying to remove all the characters that are not in the range between HEBREW LETTER ALEF (א) and HEBREW LETTER TAV (ת). I'd expected the following code to return "חום", however it returns "חם".

I guess that what happens is that by default Perl 6 works by graphemes, considers וּ to be one grapheme, and removes all of it. It's often sensible to work by graphemes, but in my case I need it to work by codepoints.

I tried to find an adverb that would make it work by codepoint, but couldn't find it. Perhaps there is also a way in Perl 6 to use Unicode properties to exclude diacritics, or to include only letters, but I couldn't find that either.

Thanks!

Upvotes: 8

Views: 230

Answers (2)

Christoph
Christoph

Reputation: 169573

My regex-fu is weak, so I'd go with a less magical solution.

First, you can remove all marks via samemark:

'חוּם'.samemark('a')

Second, you can decompose the graphemes via .NFD and operate on individual codepoints - eg only keeping values with property Grapheme_Base - and then recompose the string:

Uni.new('חוּם'.NFD.grep(*.uniprop('Grapheme_Base'))).Str

In case of mixed strings, stripping marks from Hebrew characters only could look like this:

$str.subst(:g, /<:Script<Hebrew>>+/, *.Str.samemark('a'));

Upvotes: 9

H&#229;kon H&#230;gland
H&#229;kon H&#230;gland

Reputation: 40748

Here is a simple approach:

my $hum = 'חוּם';
my $min = "\c[HEBREW LETTER ALEF]".ord;
my $max = "\c[HEBREW LETTER TAV]".ord;
my @ords;
for $hum.ords {
    @ords.push($_) if $min ≤ $_ ≤ $max; 
}
say join('', @ords.map: { .chr });

Output:

חום

Upvotes: 3

Related Questions