Reputation: 525
I'm currently trying to parse a document with DOMDocument, and I'm having some serious problems. I created a script that runs fine on php 5.2.9, ripping out content using DOMNode::nodeValue. The same script fails to get any content on php 5.3.3 - even though it correctly navigates to the proper nodes to extract content.
Basically, the code used looks like this:
$dom = new DOMDocument();
$dom->loadHTML($data);
$dom->preserveWhiteSpace = false;
$xpath = new DOMXpath($dom);
$nodelist = $xpath->query($query);
$value = $nodelist->item(0)->nodeValue;
I've checked to make sure that item(0) is in fact a node - it's there and even of the right type, but nodeValue is empty.
The script works on some documents but not others (on 5.3.3) - on 5.2.9 it works on all documents, returning the proper nodeValue.
Upvotes: 3
Views: 1286
Reputation: 525
I seem to have missed something basic and/or a bug (though if the bug is in php or libxml I don't know). Basically, the issue is fixed by making sure the data loaded with loadHTML is UTF-8 encoded. Mind you, it's not the entire document that needs to be UTF-8 encoded - the problem here was that there was a character in the element which wasn't in UTF-8. That then threw off everything else in the document handling.
What gets me is that this basically meant all document content was thrown out - but the structure was in place working normally. No errors or anything to suggest the content was seen as invalid.
Upvotes: 2