vaso123
vaso123

Reputation: 12391

PHP function mb_detect_encoding strict mode

In the function mb_detect_encoding there is a parameter for strict mode.

In the first, most upvoted comment:

<?php
$str = 'áéóú'; // ISO-8859-1
mb_detect_encoding($str, 'UTF-8'); // 'UTF-8'
mb_detect_encoding($str, 'UTF-8', true); // false

This is true, yes. But can anybody give me an explanation, why is it?

Upvotes: 7

Views: 4529

Answers (3)

user3942918
user3942918

Reputation: 26413

Everything in this answer is based on my reading of the code here and here.

I did not write it, I did not step through it with a debugger, this is my interpretation only.


It seems that the intention was for strict mode to check if the string as a whole was valid for the encoding, while non-strict mode would allow for a sub-sequence that could be part of a valid string. For example, if the string ended with what should be the first byte of a multi-byte character it would not match in strict mode but would still qualify as UTF-8 under non-strict mode.

However there seems to be a bug* where in non-strict mode only the first byte of the string is being checked in some circumstances.

Example:

The byte 0xf8 is not allowed anywhere in UTF-8. When placed at the start of a string mb_detect_encoding() properly returns false for it regardless of which mode is used.

$str = "\xf8foo";

var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // bool(false)
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);

But as long as the leading byte may occur anywhere in a UTF-8 sequence, non-strict mode returns UTF-8.

$str = "foo\xf8";

var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // string(5) "UTF-8"
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);

So while your ISO-8859-1 string 'áéóú' is not valid UTF-8, the first byte "\xe1" can occur in UTF-8 and mb_detect_encoding() mistakenly returns the string as such.


*I've opened a report for this at https://bugs.php.net/bug.php?id=72933

Upvotes: 4

&#193;lvaro Gonz&#225;lez
&#193;lvaro Gonz&#225;lez

Reputation: 146640

áéóú in ISO-8859-1 encodes as:

e1 e9 f3 fa

If you mis-interpret it as UTF-8 you only get four invalid byte sequences. The Multi-Byte extension is basically designed to ignore errors. For instance, mb_convert_encoding() will replace those sequences with question marks or whatever you set with mb_substitute_character().

My educated guess is that strict encoding determines what should be done with invalid byte sequences:

  • false means to remove them
  • true means to keep them

If you ignore these invalid sequences you're obviously discarding extremely valuable information and you only get sensible results in very limited circumstances, e.g.

$str = chr(81);
var_dump( mb_detect_encoding($str, ['ISO-8859-1', 'Windows-1252']) );
var_dump( mb_detect_encoding($str, ['Windows-1252', 'ISO-8859-1']) );

To sum up, mb_detect_encoding() is in general not as useful as you may thing and it's total crap with the default parameters.

Upvotes: 2

Justinas
Justinas

Reputation: 43557

Because $str is not actual UTF-8, but ISO-8859-1. Since when not strict comparison, UTF-8 may be treated same as ISO-8859-1, but when using strict mode only actual UTF-8 fits for UTF-8 comparison (explained here)

Upvotes: -2

Related Questions