Reputation: 6837
I am trying to write some code to take UTF-8 text and create a slug that contains some UTF-8 characters. So this is not about transliterating UTF-8 into ASCII.
So basically I want to replace any UTF-8 character that is whitespace, a control character, punctuation, or a symbol with a dash. There exist Unicode categories that I should be able to use: \p{Z}
, \p{C}
, \p{P}
, or \p{S}
, respectively.
So I could do something as simple as this:
preg_replace("#(\p{P}|\p{C}|\p{S}|\p{Z})+#", "-", "This. test? has an ö in it");
but it results in this:
This-test-has-an-√-in-it
(and I'd want This-test-has-an-ö-in-it
)
It butchers the umlaut o, possibly because in Unicode it is comprised of two bytes c3b6
of which the b6
seems to be recognized as a punctuation character (so the \p{P}
matches here). The c3
remains in the text. This is strange because AFAIK a single byte b6
doesn't exist in UTF-8.
I also tried the same thing in Perl in order to make sure it is not a PHP problem, but the code
$s = 'This. test? has an ö in it';
$s =~ s/(\p{P}|\p{C}|\p{S}|\p{Z})+/-/g;
also produces:
This-test-has-an-√-in-it
(which probably makes sense as PHP's PCRE are Perl Compatible Regular Expressions)
While when I do this in Python
import regex as re
text=u"This. test? has an ö in it";
print re.sub(ur"(\p{P}|\p{C}|\p{S}|\p{Z})+", "-", text)
it produces my desired
This-test-has-an-ö-in-it
What to do?
Upvotes: 1
Views: 2326
Reputation: 6837
The solution was to use the "Unicode modifier" u
:
preg_replace("#(\p{P}|\p{C}|\p{S}|\p{Z})+#u", "-", "This. test? has an ö in it");
correctly produces
This-test-has-an-ö-in-it
So: using Unicode Categories without the Unicode modifier produces strange results without any warning.
Upvotes: 2