Reputation: 1033
I just faced that HTML document encoded with UTF-8 becomes incorrect after loadHTML()
.
There are plenty of such QnA's on the Stack:
However as I understand most of the answers taking assumption that the source encoding is UTF-8. So it is recommended to use mb_convert_encoding()
function in this way:
$dom->loadHTML(mb_convert_encoding($document_in_utf_8, 'HTML-ENTITIES', 'UTF-8'));
I suppose that this will work only if the source document is in UTF-8. Unfortunately in my world the source document can has any available codding: Windows-1251, UTF-8, KOI8-R and so on...
So what is the best practice to handle this problem for any encoding?
UPDATE 1: Just find mb_detect_encoding()
function. Is it a good practice to use one to detect the encoding? In the way like this:
$encoding = mb_detect_encoding($doc);
$doc = mb_convert_encoding($doc, 'HTML-ENTITIES', $encoding);
$dom->loadHTML($doc);
Tested this on several documents: looks like it working, but can I be sure that this will work for all reasonable cases?
Upvotes: 2
Views: 460