DOMDocument breaks encoding?

Question

I run the following code:

$page = 'Ä';
$DOM = new DOMDocument;
$DOM->loadHTML($page);
echo 'source:'.$page;
echo 'dom: '.$DOM->getElementsByTagName('p')->item (0)->textContent;

and it outputs the following:

source: Ä

dom: Ã

so, I don't understand why when the text comes through DOMDocument its encoding becomes broken?

Niet the Dark Absol · Accepted Answer

DOMDocument appears to be treating the input as UTF-8. In this conversion, Ä becomes Ã„. Here's the catch: That second character does not exist in ISO-8859-1, but does exist in Windows-1252. This is why you are seeing no second character in your output.

You can fix this by calling utf8_decode on the output of textContent, or using UTF-8 as your page's character encoding.

DOMDocument breaks encoding?

Answers (2)

Related Questions