Reputation: 643
I need to strip some values and also some raw HTML from an HTML document. I thought of using XPath, but I cannot get my queries to work.
Here is what I want to achieve:
<div class="unit-id">
<div class="title">
some title-1
</div>
<div class="another-class">
another class
</div>
<p>segwegw1<p>
<p>segwegw1<p>
<p>segwegw1<p>
<p>segwegw1<p>
<ul>
<li>jfjfj</li>
<li>jfjfj</li>
<li>jfjfj</li>
</ul>
</div>
<div class="unit-id">
<div class="title">
some title-2
</div>
<div class="another-class">
some other class
</div>
<p>segwegw2<p>
<p>segwegw2<p>
<p>segwegw2<p>
<p>segwegw2<p>
</div>
<div class="unit-id">
<div class="title">
some title-3
</div>
<div class="some-other-class">
some other data
</div>
<p>segwegw3<p>
<p>segwegw3<p>
<p>segwegw3<p>
<p>segwegw3<p>
</div>
So I'd like the query to iterate through each div
with a unit-id class and return the value of the divs
with a class of title
and the rest of the HTML, excluding any more divs
so just the p
tags and ul
stuff for that particular unit-id classed div
, and then the next iteration.
Is that possible? Could you provide me with an example of how to write this query? Is there a better way to do it?
Upvotes: 1
Views: 1109
Reputation: 7795
This code does something like what you're looking for:
function get_content($data){
$doc = new DOMDocument();
//load HTML string into document object
if ( ! @$doc->loadHTML($data)){
return FALSE;
}
//create XPath object using the document object as the parameter
$xpath = new DOMXPath($doc);
$query = "//div[@class='unit-id']";
//XPath queries return a NodeList
$res = $xpath->query($query);
$out = array();
foreach ($res as $key => $node){
//subquery
$sub = $xpath->query('.//div[@class="title"]', $node);
$out[$key]['title'] = trim($sub->item(0)->nodeValue);
foreach ($node->getElementsByTagName('p') as $key2 => $value){
$out[$key]['par'][$key2] = $value->nodeValue;
}
foreach ($node->getElementsByTagName('li') as $key2 => $value){
$out[$key]['list'][$key2] = $value->nodeValue;
}
}
return $out;
}
Please note that you have errors in your HTML. You're closing paragraph tags should have the trailing slash </p>
.
Here's the output:
array
0 =>
array
'title' => string 'some title-1' (length=12)
'par' =>
array
0 => string 'segwegw1' (length=8)
1 => string 'segwegw1' (length=8)
2 => string 'segwegw1' (length=8)
3 => string 'segwegw1' (length=8)
'list' =>
array
0 => string 'jfjfj' (length=5)
1 => string 'jfjfj' (length=5)
2 => string 'jfjfj' (length=5)
1 =>
array
'title' => string 'some title-2' (length=12)
'par' =>
array
0 => string 'segwegw2' (length=8)
1 => string 'segwegw2' (length=8)
2 => string 'segwegw2' (length=8)
3 => string 'segwegw2' (length=8)
Upvotes: 3