Reputation: 27190
This is my code:
$oDom = new DOMDocument();
$oDom->loadHTML("èàéìòù");
echo $oDom->saveHTML();
This is the output:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>èà éìòù</p></body></html>
I want this output:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>èàéìòù</p></body></html>
I've tried with ...
$oDom = new DomDocument('4.0', 'UTF-8');
or with 1.0 and other stuffs but nothing.
Another thing ...
There is a way to obtain the same untouched HTML?
For example with this html in input <p>hello!</p>
obtain the same output <p>hello!</p>
using DOMDocument only for parsing the DOM and to do some substitutions inside the tags.
Upvotes: 36
Views: 24574
Reputation: 2023
This worked for me:
<?php
$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
// dirty fix
foreach ($doc->childNodes as $item) {
if ($item->nodeType == XML_PI_NODE) {
$doc->removeChild($item); // remove hack
}
}
?>
Credits: https://www.php.net/manual/en/domdocument.loadhtml.php#95251
Upvotes: 0
Reputation: 1888
What worked for me was:
$doc->loadHTML(mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'));
credit: https://davidwalsh.name/domdocument-utf8-problem
Upvotes: 3
Reputation: 480
None of the above worked for me but this one did the job:
$fileContent = file_get_contents('my_file.html');
$dom = new DOMDocument();
@$dom->loadHTML(mb_convert_encoding($fileContent, 'HTML-ENTITIES', 'UTF-8'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$dom->encoding = 'utf-8';
$html = $dom->saveHTML();
$html = html_entity_decode($html, ENT_COMPAT, 'UTF-8');
echo $html;
Upvotes: 1
Reputation: 121
$dom = new DomDocument();
$str = htmlentities($str);
$dom->loadHTML(utf8_decode($str));
$dom->encoding = 'utf-8';
.
.
.
$str = $dom->saveHTML();
$str = html_entity_decode($str);
The above code worked for me.
Upvotes: 7
Reputation: 181
I don't know why the marked answer didn't work for my problem. But this one did.
ref: https://www.php.net/manual/en/class.domdocument.php
<?php
// checks if the content we're receiving isn't empty, to avoid the warning
if ( empty( $content ) ) {
return false;
}
// converts all special characters to utf-8
$content = mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8');
// creating new document
$doc = new DOMDocument('1.0', 'utf-8');
//turning off some errors
libxml_use_internal_errors(true);
// it loads the content without adding enclosing html/body tags and also the doctype declaration
$doc->LoadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
// do whatever you want to do with this code now
?>
Upvotes: 6
Reputation: 1742
This way:
/**
* @param string $text
* @return DOMDocument
*/
private function buildDocument($text)
{
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $text);
libxml_use_internal_errors(false);
return $dom;
}
Upvotes: 4
Reputation: 27190
Solution:
$oDom = new DOMDocument();
$oDom->encoding = 'utf-8';
$oDom->loadHTML( utf8_decode( $sString ) ); // important!
$sHtml = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">';
$sHtml .= $oDom->saveHTML( $oDom->documentElement ); // important!
The saveHTML()
method works differently specifying a node.
You can use the main node ($oDom->documentElement
) adding the desired !DOCTYPE
manually.
Another important thing is utf8_decode()
.
All the attributes and the other methods of the DOMDocument
class, in my case, don't produce the desired result.
Upvotes: 66
Reputation:
The issue appears to be known, according to the user comments on the manual page at php.net. Solutions suggested there include putting
<meta http-equiv="content-type" content="text/html; charset=utf-8">
in the document before you put any strings with non-ASCII chars in.
Another hack suggests putting
<?xml encoding="UTF-8">
as the first text in the document and then removing it at the end.
Nasty stuff. Smells like a bug to me.
Upvotes: 5
Reputation: 189
Try to set the encoding type after you have loaded the HTML.
$dom = new DOMDocument();
$dom->loadHTML($data);
$dom->encoding = 'utf-8';
echo $dom->saveHTML();
Upvotes: 7
Reputation: 944548
Looks like you just need to set substituteEntities when you create the DOMDocument object.
Upvotes: 0