Reputation: 26771
I'm parsing a third-party web page using PHP's DOMElement controls. When I use the web page with my browser and view the source, it's clean, but when I access some of the nodes through the DOMElement->nodeValue parameter the HTML tags aren't there, and there are several newlines and this character Â. According to this answer, this is the character that shows up when there's an encoding issue.
I also get that gobbly-gook using:
My question is how I can simply get the clean HTML code inside the DOMElement?
Here is the clean HTML code:
<b>Author:</b> AUTHOR<br>
<b>ISBN:</b> 9780684857220 <br>
<b>Edition/Copyright:</b> 7<br>
<b>Publisher:</b> J+M<br>
<b>Published Date:</b> 1989<br>
Here is what nodeValue gives:
Â
Author:Â AUTHOR ISBN:Â 9780684857220 Edition/Copyright:Â 7 Publisher:Â J+M Published Date:Â
1989
Upvotes: 1
Views: 840
Reputation: 26771
Turns out it wasn't an encoding issue but rather I was using the wrong methods. This works:
$doc = new DOMDocument();
$doc->appendChild($doc->importNode($second_td,true));
echo $doc->saveHTML();
Upvotes: 2
Reputation: 3284
Have you tried specifying the encoding when you create the DOM document? For example:
$doc = new DOMDocument('1.0', 'utf-8');
$doc->loadXML($third_party_web_page_string);
or
$doc = new DOMDocument('1.0', 'iso-8859-1');
$doc->loadXML($third_party_web_page_string);
If neither of those work, you could try using the iconv
function over the data before you load it into the DOM object.
Upvotes: 2