Mohebifar
Mohebifar

Reputation: 3411

wrong characters encoding DOMDocument php

i have some html content that all of its texts are Persian ! i want to give this content to DOMDocument by method DOMDocument::loadHTML($html) to do some stuff and then give it back by DOMDocument::saveHTML() ... but there is a problem in showing characters :-( for example "سلام" changed to "سلام", even I changed my script file encoding to UTF-8 but it doesn't work.

<?php
$html = "<html><meta charset='utf-8' /> سلام</html>";

$doc = new DOMDocument('1.0', 'utf-8');
$doc->loadHTML($html);
print $html; // output : سلام
print $doc->saveHTML(); // output : سلام
print $doc->saveHTML($doc->documentElement); // output : سÙاÙ
?>

UPDATE: according to friends instruction, i used $doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8')); and it worked !

Upvotes: 4

Views: 2708

Answers (2)

Peyman Mohamadpour
Peyman Mohamadpour

Reputation: 17964

$html = '<html>سلام</html>';
$doc = new DOMDocument();

Converting the character encoding of string $html, to UTF-8 and then load it to the DOM, using 2 libxml predefined constants (LIBXML_HTML_NOIMPLIED & LIBXML_HTML_NODEFDTD).

The first one sets HTML_PARSE_NOIMPLIED flag, which turns off the automatic adding of implied html/body... elements (which is only avilable as of PHP 5.4.0).

The second one sets HTML_PARSE_NODEFDTD flag, which prevents a default doctype being added when one is not found. using these constants help you manage your parsing in a more flexible manner.

$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

Then you define the DOM encoding, itself (the previouse definition was for input):

$doc->encoding = 'UTF-8';

Remove leading and trailing <html> & <body> tags, in case you are not using libxml 2.7.7 (as of PHP >= 5.4.0):

$doc->normalizeDocument(); //Remove leading and trailing <html> & <body> tags
print $doc->saveHTML($doc->documentElement);

Have fun!

Upvotes: 3

Alf Eaton
Alf Eaton

Reputation: 5483

Tell the XML parser that the data being read is UTF-8 encoded:

<?php

// original input (unknown encoding)
$html = '<html>سلام</html>';

$doc = new DOMDocument();

// specify the input encoding
$doc->loadHTML('<?xml encoding="utf-8"?>' . $html);

// specify the output encoding
$doc->encoding = 'utf-8';

// output: <html><body><p>سلام</p></body></html>
print $doc->saveHTML($doc->documentElement);

Upvotes: 4

Related Questions