tomsseisums
tomsseisums

Reputation: 13367

Preserve utf8 when loading HTML from file

Well, apparently, PHP and it's standard libraries have some problems, and DOMDocument isn't an exception.

There are workarounds for utf8 characters when loading HTML string - $dom->loadHTML().

Apparently, I haven't found a way to do this when loading HTML from file - $dom->loadHTMLFile(). While it reads and sets the encoding from <meta /> tags, the problem strikes back if I haven't defined those. For instance, when loading a fragment of HTML (template part, like, footer.html), not a fully built HTML document.

So, how do I preserve utf8 characters, when loading HTML from file, that hasn't got it's <meta /> keys present, and defining those is not an option?

Update

footer.html (the file is encoded in UTF-8 without BOM):

<div id="footer">
    <p>My sūpēr ōzōm ūtf8 štrīņģ</p>
</div>

index.php:

$dom = new DOMDocument;
$dom->loadHTMLFile('footer.html');
echo $dom->saveHTML(); // results in all familiar effed' up characters

Thanks in advance!

Upvotes: 4

Views: 4589

Answers (4)

Luke
Luke

Reputation: 1329

I would suggest using my answer here: https://stackoverflow.com/a/12846243/816753 and instead of adding another <head>, wrap your entire fragment in

<html>
    <head><meta http-equiv='Content-type' content='text/html; charset=UTF-8' /></head>
    <body><!-- your content here --></body>
</html>`

Upvotes: 5

Sinthia V
Sinthia V

Reputation: 2093

Try a hack like this one:

$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
// dirty fix
foreach ($doc->childNodes as $item)
    if ($item->nodeType == XML_PI_NODE)
        $doc->removeChild($item); // remove hack
$doc->encoding = 'UTF-8'; // insert proper

Several others are listed in the user comments here: http://php.net/manual/en/domdocument.loadhtml.php. It is also important that your document head includea meta tag to specify encoding FIRST, directly after the tag.

Upvotes: 6

simshaun
simshaun

Reputation: 21476

While I'm not sure about how to go about solving the problem with ->loadHTMLFile(), have you considered using file_get_contents() to get the HTML, run mb_convert_encoding() on that string, then pass that value in to ->loadHTML()?

Edit: Also, when you initialize DOMDocument, are you giving it the $encoding argument?

Upvotes: 4

jValdron
jValdron

Reputation: 3418

The key is for your browser only. Once the page is all built up, your browser should display the page correctly if it has the meta at the end.

You can always try to use the utf8_decode (or encode, I'm never sure lol) function before echo'ing the data like so:

echo utf8_decode($dom->saveHTML());

Upvotes: 3

Related Questions