js_
js_

Reputation: 4731

Is it safe to use preg_ functions with ascii-character pattern and utf-8 multibyte subject?

I know that if I use multibyte(UTF-8) characters for the pattern, I have to use mb_ functions or have to use u option for pattern of preg_ functions.

But when I use multibyte(UTF-8) characters only for the subject of preg_ functions and use only ascii characters for the pattern, do preg_ functions (without u option) work correctly?

I know that in this case I have to use mb_ function or add u option to the pattern:

$str = preg_replace("/$utf8_multibyte_pattern/", '', $str);

I want to know if this code(u option is not used) is safe or not:

$ascii_pattern = "[a-zA-Z0-9'$#\\\"%&()\-~|~=!@`{}[]:;+*/.,_<>?_\n\t\r]";
$multibyte_str = preg_replace("/$ascii_pattern/", '', $utf8_multibyte_str);

Upvotes: 1

Views: 118

Answers (2)

js_
js_

Reputation: 4731

Maybe I found the answer by myself.

But someone who knows about character code well, please comment to this answer or post another answer.

According to wikipedia, UTF-8 character codes don't contain ascii code.

http://en.wikipedia.org/wiki/UTF-8#Advantages

The ASCII characters are represented by themselves as single bytes that do not appear anywhere else, which makes UTF-8 work with the majority of existing APIs that take bytes strings but only treat a small number of ASCII codes specially. This removes the need to write a new Unicode version of every API, and makes it much easier to convert existing systems to UTF-8 than any other Unicode encoding.

I think this means preg function with ascii pattern without u option is safe for multibyte(UTF8) subject.

And this code (without u option)

$multibyte_str = preg_replace("/$ascii_pattern/", '', $utf8_multibyte_str);

and this code (with u option)

$multibyte_str = preg_replace("/$ascii_pattern/u", '', $utf8_multibyte_str);

are the same. Both correctly works.

Am I correct?

Upvotes: 1

Ben Carey
Ben Carey

Reputation: 16968

It is safe as far as I know as long as you use the unicode property (/u) like so:

$ascii_pattern = "[a-zA-Z0-9'$#\\\"%&()\-~|~=!@`{}[]:;+*/.,_<>?_\n\t\r]";
$multibyte_str = preg_replace("/$ascii_pattern/u", '', $utf8_multibyte_str);

To see more information on unicode characters, see here

Upvotes: 0

Related Questions