Learnerer
Learnerer

Reputation: 583

How is UTF-8 safe relative to ASCII chars

I was reading on Wikipedia, and came across the following:

"Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, 
UTF-8 is safe to use within most programming and document languages that 
interpret certain ASCII characters in a special way, such as "/" in filenames, 
"\" in escape sequences, and "%" in printf."

What I can't understand is how this would be a problem, even if it happened. If the application processing the bytes supports utf-8 then that's a straightforward situation and there would be no problem since it'll know to interpret them in context of other predecessor/successor bytes. And if it doesn't, then it has no business dealing with it in the first place, and the fact that it might come across a combination of bits that's a format character like '\' is no more harm than already is processing it in first place.

Upvotes: 1

Views: 789

Answers (1)

deceze
deceze

Reputation: 521994

Take PHP for example. PHP has no native understanding of encodings (there are some asterisks and footnotes here, but let's say it doesn't). It looks for certain specific bytes in source code which mean something to it, and mostly just passes through anything else that doesn't have a specific meaning for it. E.g.:

$foo = "bar $baz 42";

This triggers string interpolation; PHP will try to interpolate the variable $baz into this string. It does that by looking for the byte 0x24 (ASCII "$") and the next "non-word" byte in the string, which leads it to find the variable name $baz inside the string. Anything else in the string it just passes through as is.

You can do this on PHP:

echo "意味分からない";

All PHP sees here is some binary blob which is of no particular interest to it. It does not support or understand those characters, but neither is it trying to do anything with them. It just passes the binary data through as is, and thereby happens to output the desired Japanese sentence.

Now, if we'd have written that sentence in some non-ASCII-safe encoding like, say, ISO-2022-JP-3, that would be:

1b24 4230 554c 234a 2c24 2b24 6924 4a24 241b 2842

You'll notice the 24 bytes in there. If you could produce a valid PHP file which contained these bytes between double quotes, PHP would try to interpret those 0x24 bytes as a $ and try to interpolate variables there.

$ cat /tmp/foo.php 
<?php echo "B0UL#J,$+$i$J$$";
$ xxd /tmp/foo.php 
00000000: 3c3f 7068 7020 6563 686f 2022 1b24 4230  <?php echo ".$B0
00000010: 554c 234a 2c24 2b24 6924 4a24 241b 2842  UL#J,$+$i$J$$.(B
00000020: 223b 0a                                  ";.
$ php /tmp/foo.php 
PHP Notice:  Undefined variable: B0UL in /tmp/foo.php on line 1
PHP Notice:  Undefined variable: i in /tmp/foo.php on line 1
PHP Notice:  Undefined variable: J in /tmp/foo.php on line 1

That's one example of a situation where UTF-8 compatibility with ASCII is important.

Upvotes: 5

Related Questions