Reputation: 3755
$convertedhtml = urlencode(mb_convert_encoding($htmlcode,'UTF-8',"auto"));
$doc = new DOMDocument();
$doc->loadHTML($convertedhtml);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*[@id='detail']/div[1]/h3/text()");
$elements->item(0)->nodeValue;
return ($elements->item(0)->nodeValue);
The website is in gbk encoding. If i do a Convert , it will not even show anything, but if i dont convert, it doesnt show the correct characters.
Any idea ? From what i know, mb_* doesn't support gbk?
Upvotes: 0
Views: 273
Reputation: 197544
The DOMDocument::loadHTML()
method does not expect an UTF-8 encoded string. So you can say it is an exception to the many other methods in the DOM extension because all those expect an UTF-8 encoded string. Same btw. applies to all methods of the DOM extension that care about loading XML/HTML data from either a file, a remote-location or a string. They follow different and more complex rules for the encoding of the string.
Encoding for DOMDocument::loadHTML()
:
If the HTML string you pass in there does not contain any hinting on the encoding (e.g. inside meta-tags), then the encoding of the string must be Latin-1.
If the string does have a hint of the encoding, then it needs to be in that hinted encoding and that one needs to be one of the supported encodings.
Notes:
DOMDocument::loadHMTL()
exists.However: For an example on how to load a HTML document or fragment of a specific encoding see this related answer of mine:
It most likely will show you how you can load your HTML. It also explains this in more detail. Let me know if it doesn't solve your issue.
Upvotes: 1