Reputation: 6958
I'm currently somewhat stuck getting a regular expression in Perl (taken from an earlier question of mine) to match word characters from a non-ASCII locale (i.e., German umlauts).
I already tried various things such as setting the correct locale (using setlocale), converting data that I receive from MySQL to UTF8 (using decode_utf8), and so on... Unfortunately, to no avail. Google also did not help much.
Is there any chance to get the following regex locale-aware so that
$street = "Täststraße"; # I know that this is not orthographically correct
$street =~ s{
\b (\w{0,3}) (\w*) \b
}
{
$1 . ( '*' x length $2 )
}gex;
ends up returning $street = "Täs*******"
instead of "Tästs***ße"
?
Upvotes: 4
Views: 1395
Reputation: 42674
I would expect that the regex result in "Täs*******". And this is what I get when I "use utf8" in a utf-8 encoded file with your code above.
(If everything is latin-1, that changes the behavior of the regex engine. Hence the existence of utf8::upgrade
. See Unicode::Semantics.)
Edit: I see you fixed your post and that we agree on the expected result. Basically, use Unicode::Semantics when you want Unicode semantics on your regexps.
Upvotes: 6