Reputation: 337
I recently switched from PHP 7 to PHP 8.2.7 and noticed that mb_detect_encoding seems to no longer work reliably. I am doing the following:
mb_detect_encoding(file_get_contents('somefile.csv'), 'UTF-8, ISO-8859-1', true);
For one particular file, the value returned is ISO-8859-1, even though it clearly is a UTF-8 encoded file. It even has a UTF-8 BOM. I understand that the detection algorithm changed in PHP 8, but how can a clear UTF-8 encoded file be mistaken for ISO? I get that most UTF-8 encoded strings are also valid ISO strings, but what good is mb_detect_encoding if it fails to see the difference? BTW. the file in question is 1759 Bytes long and has around 30 UTF-8 two-byte characters, so IMO this should be plenty to detect it as UTF-8.
I cannot upload files, but this is the first line of the .CSV file:
Buchungstag;Wertstellung;Umsatzart;Buchungstext;Betrag;Währung;Auftraggeberkonto;Bankleitzahl Auftraggeberkonto;IBAN Auftraggeberkonto
Upvotes: 1
Views: 576
Reputation: 337
I still feel that mb_detect_encoding is broken, but at least I found that is is the Byte-order-mark EF BB BF at the beginning of the file that appears to throw it off. If it is present and the rest of the file is too short (< about 4K), it appears to be detected as ISO-8859-1.
One workaround is to strip off a potential BOM first:
$encoding = mb_detect_encoding(preg_replace("/^\xef\xbb\xbf/", '', file_get_contents('somefile.csv')), 'UTF-8, ISO-8859-1', true);
or to use mb_check_encoding:
$encoding = mb_check_encoding(file_get_contents('somefile.csv'), 'UTF-8') ? 'UTF-8' : 'ISO-8859-1';
Upvotes: 0