Reputation: 2153
I run the following code:
$page = '<p>Ä</p>';
$DOM = new DOMDocument;
$DOM->loadHTML($page);
echo 'source:'.$page;
echo 'dom: '.$DOM->getElementsByTagName('p')->item (0)->textContent;
and it outputs the following:
source: Ä
dom: Ã
so, I don't understand why when the text comes through DOMDocument its encoding becomes broken?
Upvotes: 5
Views: 1516
Reputation: 173662
Here's a workaround that adds the proper encoding via meta header:
$DOM->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />' . $page);
I'm not sure if that's the actual character set you're trying to use, but adjust where necessary
See also: domdocument character set issue
Upvotes: 8
Reputation: 324820
DOMDocument
appears to be treating the input as UTF-8. In this conversion, Ä
becomes Ä
. Here's the catch: That second character does not exist in ISO-8859-1, but does exist in Windows-1252. This is why you are seeing no second character in your output.
You can fix this by calling utf8_decode
on the output of textContent
, or using UTF-8 as your page's character encoding.
Upvotes: 6