Reputation: 4731
I know that if I use multibyte(UTF-8) characters for the pattern, I have to use mb_
functions or have to use u
option for pattern of preg_
functions.
But when I use multibyte(UTF-8) characters only for the subject of preg_
functions and use only ascii characters for the pattern, do preg_
functions (without u
option) work correctly?
I know that in this case I have to use mb_
function or add u
option to the pattern:
$str = preg_replace("/$utf8_multibyte_pattern/", '', $str);
I want to know if this code(u
option is not used) is safe or not:
$ascii_pattern = "[a-zA-Z0-9'$#\\\"%&()\-~|~=!@`{}[]:;+*/.,_<>?_\n\t\r]";
$multibyte_str = preg_replace("/$ascii_pattern/", '', $utf8_multibyte_str);
Upvotes: 1
Views: 118
Reputation: 4731
Maybe I found the answer by myself.
But someone who knows about character code well, please comment to this answer or post another answer.
According to wikipedia, UTF-8 character codes don't contain ascii code.
http://en.wikipedia.org/wiki/UTF-8#Advantages
The ASCII characters are represented by themselves as single bytes that do not appear anywhere else, which makes UTF-8 work with the majority of existing APIs that take bytes strings but only treat a small number of ASCII codes specially. This removes the need to write a new Unicode version of every API, and makes it much easier to convert existing systems to UTF-8 than any other Unicode encoding.
I think this means preg function with ascii pattern without u option is safe for multibyte(UTF8) subject.
And this code (without u option)
$multibyte_str = preg_replace("/$ascii_pattern/", '', $utf8_multibyte_str);
and this code (with u option)
$multibyte_str = preg_replace("/$ascii_pattern/u", '', $utf8_multibyte_str);
are the same. Both correctly works.
Am I correct?
Upvotes: 1
Reputation: 16968
It is safe as far as I know as long as you use the unicode property (/u
) like so:
$ascii_pattern = "[a-zA-Z0-9'$#\\\"%&()\-~|~=!@`{}[]:;+*/.,_<>?_\n\t\r]";
$multibyte_str = preg_replace("/$ascii_pattern/u", '', $utf8_multibyte_str);
To see more information on unicode characters, see here
Upvotes: 0