Vlada Katlinskaya
Vlada Katlinskaya

Reputation: 1033

How to handle HTML with PHP's DOMDocument if the encoding of source HTML is unknown?

I just faced that HTML document encoded with UTF-8 becomes incorrect after loadHTML().

There are plenty of such QnA's on the Stack:

However as I understand most of the answers taking assumption that the source encoding is UTF-8. So it is recommended to use mb_convert_encoding() function in this way:

$dom->loadHTML(mb_convert_encoding($document_in_utf_8, 'HTML-ENTITIES', 'UTF-8'));

I suppose that this will work only if the source document is in UTF-8. Unfortunately in my world the source document can has any available codding: Windows-1251, UTF-8, KOI8-R and so on...

So what is the best practice to handle this problem for any encoding?

UPDATE 1: Just find mb_detect_encoding() function. Is it a good practice to use one to detect the encoding? In the way like this:

$encoding = mb_detect_encoding($doc);
$doc = mb_convert_encoding($doc, 'HTML-ENTITIES', $encoding);
$dom->loadHTML($doc);

Tested this on several documents: looks like it working, but can I be sure that this will work for all reasonable cases?

Upvotes: 2

Views: 460

Answers (0)

Related Questions