ilija veselica
ilija veselica

Reputation: 9574

Why does php DOM parsing affect charset?

$dom = new DOMDocument();
$dom->loadHTML($string);
$dom->preserveWhiteSpace = false;
$elements = $dom->getElementsByTagName('span');
$spans = array();
foreach($elements as $span) {
    $spans[] = $span;
}
foreach($spans as $span) {
    $span->parentNode->removeChild($span);
}
return $dom->saveHTML();    
//return $string;

When I use this code to parse string it changes encoding and symbols are not shown the same when return $string is uncommented. Why is that so and how to avoid charset change

Ile

Upvotes: 0

Views: 997

Answers (3)

ilija veselica
ilija veselica

Reputation: 9574

There is also one interesting thing I noticed today... I didn't realized why it happens but it's very strange behavior... code from the top is set to function. When string is passed to function and after function process it to returned string is added <doctype...> <html><body>STRING</body></html> in some unexplainable cases: Data is loaded from database and when this data from db is directly proceeded to function it doesnt add this extra tags, but when data is first stored to variable and than this function is called somewhere below these extra values are added. Also one strange thing... I had a case when I called this extra function to process string and few lines below I added trim function it returned me error from dom function... and when I delete this trim function (that was called AFTER the dom function) the error disappeared... Any reasonable explanation?

Upvotes: 0

Luk&#225;š Lalinsk&#253;
Luk&#225;š Lalinsk&#253;

Reputation: 41306

Unfortunately, it seems that DOMDocument will automatically convert all characters to HTML entities unless it knows the encoding of the original document.

Apparently, one option is to add a <meta> tag with the content type/encoding to the original string, but this means that it will be present in the output as well. Removing it might not be so easy.

Another option I can think of is manually decoding the HTML entities, using a code like this:

$trans = array_flip(get_html_translation_table(HTML_ENTITIES));
unset($trans["&quot;"], $trans["&lt;"], $trans["&gt;"], $trans["&amp;"]);
echo strtr($dom->saveHTML(), $trans);

This is a seriously ugly solution, but I can't think of anything else, other than using a different HTML parser. :(

Upvotes: 1

Gumbo
Gumbo

Reputation: 655229

Try to set the encoding in the constructor or with DOMDocument->encoding:

$dom = new DOMDocument('1.0', '…');
// or
$dom = new DOMDocument();
$dom->encoding = '…';

Upvotes: 2

Related Questions