Reputation: 2963
I'm trying to write a robot that will be fetching html parsing it daily. Now for parsing html i could use just string functions like explode, or regural expressions, but I found the dom xpath code much cleaner, so now I can make a configuration of all the sites I have to spider and tags I have to strip out like:
'http://examplesite.com' => '//div/a[@class="articleDesc"]/@href'
So the code looks like this
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//body/div[@class="articleDesc"]');
foreach ($tags as $tag)
echo $tag->nodeValue . "\n";
So with this I get all the div tags with class article description, which i great. But I noticed that all the html tags inside the div tag are stripped out. I wonder how would I get the whole contents of that div I'm looking at.
I also find it hard to see any proper documentation for $xpath->query() to see how to form the string. The php site doesn't tell much about the exact formation of it. Still, my main problem i
Upvotes: 0
Views: 5122
Reputation: 55002
The simple answer is:
foreach ($tags as $tag)
echo $dom->saveXML($tag);
If you want html unstripped a tags, the xpath would be
//a[@class="articleDesc"]
That's assuming the a tags have that class attribute
Upvotes: 2
Reputation: 1672
This should load all of the inner tags as well. While its not DOM they are interchangeable. And later you can dom_import_simplexml
tobring it back into DOM.
$xml=simplexml_load_string($html);
$tags=$xml->xpath('//body/div[@class="articleDesc"]');
Upvotes: 0
Reputation: 4966
Try using http://www.php.net/manual/en/simplexmlelement.asxml.php
Or, alternative:
function getNodeInnerHTML(DOMNode $oNode) {
$oDom = new DOMDocument();
foreach($oNode->childNode as $oChild) {
$oDom->appendChild($oDom->importNode($oChild, true));
}
return $oDom->saveHTML();
}
Upvotes: 1