Reputation: 52548
I've found a useful function on another answer and I wonder if someone could explain to me what it is doing and if it is reliable. I was using mb_detect_encoding(), but it was incorrect when reading from an ISO 8859-1 file on a Linux OS.
This function seems to work in all cases I tested.
Here is the question: Get file encoding
Here is the function:
function isUTF8($string){
return preg_match('%(?:
[\xC2-\xDF][\x80-\xBF] # Non-overlong 2-byte
|\xE0[\xA0-\xBF][\x80-\xBF] # Excluding overlongs
|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # Straight 3-byte
|\xED[\x80-\x9F][\x80-\xBF] # Excluding surrogates
|\xF0[\x90-\xBF][\x80-\xBF]{2} # Planes 1-3
|[\xF1-\xF3][\x80-\xBF]{3} # Planes 4-15
|\xF4[\x80-\x8F][\x80-\xBF]{2} # Plane 16
)+%xs', $string);
}
Is this a reliable way of detecting UTF-8 strings? What exactly is it doing? Can it be made more robust?
Upvotes: 5
Views: 10024
Reputation: 33442
This may not be the answer to your question (maybe it is, see the update below), but it could be the answer to your problem. Check out my Encoding class that has methods to convert strings to UTF8, no matter if they are encoded in Latin1, Win1252, or UTF8 already, or a mix of them:
Encoding::toUTF8($text_or_array);
Encoding::toWin1252($text_or_array);
Encoding::toISO8859($text_or_array);
// fixes UTF8 strings converted to UTF8 repeatedly:
// "FÃÂédÃÂération" to "Fédération"
Encoding::fixUTF8($text_or_array);
https://stackoverflow.com/a/3479832/290221
The function runs byte by byte and figure out if each one of them needs conversion or not.
Update:
Thinking a little bit more about it, this could in fact be the answer to your question:
require_once('Encoding.php');
function validUTF8($string){
return Encoding::toUTF8($string) == $string;
}
And here is the Encoding class: https://github.com/neitanod/forceutf8
Upvotes: 0
Reputation: 2533
The function in question (the one that the user pilif posted in the linked question) appears to have been taken from this comment on the mb_detect_encoding()
page in the PHP Manual:
As the author states, the function is only meant to "check if a string contains UTF-8 characters" and it only looks for "non-ascii multibyte sequences in the UTF-8 range". Therefore, the function returns false (zero actually) if your string just contains simple ascii characters (like english text), which is probably not what you want.
His function was based on another function in this previous comment on that same page which is, in fact, meant to check if a string is UTF-8 and was based on this regular expression created by someone at W3C.
Here is the original, correctly working (I've tested) function that will tell you if a string is UTF-8:
// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {
// From http://w3.org/International/questions/qa-forms-utf-8.html
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
} // function is_utf8
Upvotes: 0
Reputation: 2825
Basically, no.
mb_detect_encoding
is, in fact, correct by saying so. And no, you won't have any problems using ASCII text as UTF8. It's the reason UTF8 works in the first place.As far as I understand, the function you supplied does not check for validity of the string, just that it contains some sequences that happen to be similar to those of UTF8, thus this function might misfire much worse. You may want to use both this function and mb_detect_encoding
in strict mode and hope that they cancel out each others false positives.
If the text is written in a non-latin alphabet, a "smart" way to detect a multibyte encoding is to look for sequences of equally sized chunks of bytes starting with the same bits. For example, Russian word "привет" looks like this:
11010000 10111111
11010001 10000000
11010000 10111000
11010000 10110010
11010000 10110101
11010001 10000010
This, however, won't work for latin-based alphabets (and, probably, Chinese).
Upvotes: 0
Reputation: 522250
If you do not know the encoding of a string, it is impossible to guess the encoding with any degree of accuracy. That's why mb_detect_encoding
simply does not work. If however you know what encoding a string should be in, you can check if it is a valid string in that encoding using mb_check_encoding
. It more or less does what your regex does, probably a little more comprehensively. It can answer the question "Is this sequence of bytes valid in UTF-8?" with a clear yes or no. That doesn't necessarily mean the string actually is encoded in that encoding, just that it may be. For example, it'll be impossible to distinguish any single-byte encoding using all 8 bits from any other single-byte encoding using 8 bits. But UTF-8 should be rather distinguishable, though you can produce, for instance, Latin-1 encoded strings that also happen to be valid UTF-8 byte sequences.
In short, there's no way to know for sure. If you expect UTF-8, check if the byte sequence you received is valid in UTF-8, then you can treat the string safely as UTF-8. Beyond that there's hardly anything you can do.
Upvotes: 7
Reputation: 97835
That will just detect if part of the string is a formally valid UTF-8 sequence, ignoring one code unit encoded characters (representing code points in ASCII). For that function to return true it suffices that there's one character that looks like a non-ASCII UTF-8 encoded character.
Upvotes: 0
Reputation: 11986
Well, it only checks if the string has byte sequences that happen to correspond to valid UTF-8 code points. However, it won't flag the sequence 0x00-0x7F which is the ASCII compatible subset of UTF-8.
EDIT: Incidentally I am guessing the reason thought mb_detect_encoding()
"didn't work properly" was because your Latin-1 encoded file only used the ASCII compatible subset, which also is valid in UTF-8. It's no wonder that mb_detect_encoding()
would flag that as UTF-8 and it is "correct", if the data is just plain ASCII then the answer UTF-8 is as good as Latin-1, or ASCII, or any of the myriad extended ASCII encodings.
Upvotes: 0