Reputation: 2143
I'm trying to scrape a web page for content, using file_get_contents
to grab the HTML and then using a DOMDocument
object. My problem is that I cannot get the appropriate information. I'm not sure if this is because I'm using DOMDocument
's methods wrong, or if the (X)HTML in my source is just poor.
In the source, there is an element with an id of 'cards', which has two child div
s. I want the first child, which has many child div
s, who in turn have an anchor
child with div
child. I want the href
from the anchor
and the nodeValue from it's child div
.
The structure is like this:
<div id="cards">
<div class="grid">
<div class="card-wrap">
<a href="linkValue">
<img src="..."/>
<div>nameValue</div>
</a>
</div>
...
</div>
<div id="...">
</div>
</div>
I've started out with $cards = $dom->getElementById("cards")
. I get a DOMText Object, a DOMElement Object, a DOMText Object, a DOMElement Object, and a DOMText Object. I then use $grid = $cards->childNodes->item(1)
to get the first DOMElement Object, which is presumably the .grid
element. However, when I then iterate through the $grid with:
foreach($grid->childNodes as $item){
if($item->nodeName == "div"){
echo $item->nodeName,' | ',$item->nodeValue,'<br>';
}
}
I end up with a page full of "div | nameValue" where nameValue is the embedded div's nodeValue
, and I am unable to locate the anchor
s to get their href
value.
Am I doing something obviously wrong with my DOMDocument, or perhaps there is something more going on here?
Upvotes: 3
Views: 1142
Reputation: 3701
The XPath way:
$src = <<<EOS
<div id="cards">
<div class="grid">
<div class="card-wrap">
<a href="linkValue">
<img src="..."/>
<div>nameValue</div>
</a>
</div>
</div>
<div id="whatever">
</div>
</div>
EOS;
$xml = new SimpleXMLElement($src);
list ($anchor) = $xml->xpath('//div[@id="cards"]/div[1]/div[1]/a');
echo $anchor->div, ' => ', $anchor['href'], PHP_EOL;
"Get anchor of first child div of first child div of div with an id of 'cards'"
Output:
nameValue => linkValue
Upvotes: 0
Reputation: 3457
Well, from your example code if($item->nodeName == "div"){
is very going to preclude any <a>
tag. Additionally, I do not believe childNodes
allows recursive iteration.
Therefore, to access the nodes in question, you could use:
$children = $dom->getElementById("cards")->childNodes
->item(1)->childNodes->item(1)->childNodes;
Yet, as you can see this is very messy... Introducing XPath:
Upvotes: 3