JohnDotOwl
JohnDotOwl

Reputation: 3755

DOMDocument with XPath Encoding

$convertedhtml = urlencode(mb_convert_encoding($htmlcode,'UTF-8',"auto"));
$doc = new DOMDocument();
$doc->loadHTML($convertedhtml);

$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*[@id='detail']/div[1]/h3/text()");
$elements->item(0)->nodeValue;

return ($elements->item(0)->nodeValue);

The website is in gbk encoding. If i do a Convert , it will not even show anything, but if i dont convert, it doesnt show the correct characters.

Any idea ? From what i know, mb_* doesn't support gbk?

Upvotes: 0

Views: 273

Answers (1)

hakre
hakre

Reputation: 197544

The DOMDocument::loadHTML() method does not expect an UTF-8 encoded string. So you can say it is an exception to the many other methods in the DOM extension because all those expect an UTF-8 encoded string. Same btw. applies to all methods of the DOM extension that care about loading XML/HTML data from either a file, a remote-location or a string. They follow different and more complex rules for the encoding of the string.

Encoding for DOMDocument::loadHTML():

If the HTML string you pass in there does not contain any hinting on the encoding (e.g. inside meta-tags), then the encoding of the string must be Latin-1.

If the string does have a hint of the encoding, then it needs to be in that hinted encoding and that one needs to be one of the supported encodings.

Notes:

  • I'm not aware if a list of supported encodings exists.
  • As you don't show your HTML code you load in there, I can't say if it contains a hint on the encoding.
  • I'm not aware if a list of all supported ways to hint the encoding with HTML for DOMDocument::loadHMTL() exists.

However: For an example on how to load a HTML document or fragment of a specific encoding see this related answer of mine:

It most likely will show you how you can load your HTML. It also explains this in more detail. Let me know if it doesn't solve your issue.

Upvotes: 1

Related Questions