How to handle HTML with PHP's DOMDocument if the encoding of source HTML is unknown?

Question

I just faced that HTML document encoded with UTF-8 becomes incorrect after loadHTML().

There are plenty of such QnA's on the Stack:

However as I understand most of the answers taking assumption that the source encoding is UTF-8. So it is recommended to use mb_convert_encoding() function in this way:

$dom->loadHTML(mb_convert_encoding($document_in_utf_8, 'HTML-ENTITIES', 'UTF-8'));

I suppose that this will work only if the source document is in UTF-8. Unfortunately in my world the source document can has any available codding: Windows-1251, UTF-8, KOI8-R and so on...

So what is the best practice to handle this problem for any encoding?

UPDATE 1: Just find mb_detect_encoding() function. Is it a good practice to use one to detect the encoding? In the way like this:

$encoding = mb_detect_encoding($doc);
$doc = mb_convert_encoding($doc, 'HTML-ENTITIES', $encoding);
$dom->loadHTML($doc);

Tested this on several documents: looks like it working, but can I be sure that this will work for all reasonable cases?

How to handle HTML with PHP's DOMDocument if the encoding of source HTML is unknown?

Answers (0)

Related Questions

How to handle HTML with PHP&#39;s DOMDocument if the encoding of source HTML is unknown?

Answers (0)

Related Questions

How to handle HTML with PHP's DOMDocument if the encoding of source HTML is unknown?