daktau
daktau

Reputation: 643

PHP and XPath queries

I need to strip some values and also some raw HTML from an HTML document. I thought of using XPath, but I cannot get my queries to work.

Here is what I want to achieve:

<div class="unit-id">
   <div class="title">
      some title-1
   </div>

   <div class="another-class">
      another class
   </div>
   <p>segwegw1<p>
   <p>segwegw1<p>
   <p>segwegw1<p>
   <p>segwegw1<p>
   <ul>
     <li>jfjfj</li>
     <li>jfjfj</li>
     <li>jfjfj</li>
   </ul>
</div>


<div class="unit-id">
   <div class="title">
      some title-2
   </div>
   <div class="another-class">
      some other class
   </div>
   <p>segwegw2<p>
   <p>segwegw2<p>
   <p>segwegw2<p>
   <p>segwegw2<p>
</div>


<div class="unit-id">
   <div class="title">
      some title-3
   </div>
   <div class="some-other-class">
      some other data
   </div>
   <p>segwegw3<p>
   <p>segwegw3<p>
   <p>segwegw3<p>
   <p>segwegw3<p>
</div>

So I'd like the query to iterate through each div with a unit-id class and return the value of the divs with a class of title and the rest of the HTML, excluding any more divs so just the p tags and ul stuff for that particular unit-id classed div, and then the next iteration.

Is that possible? Could you provide me with an example of how to write this query? Is there a better way to do it?

Upvotes: 1

Views: 1109

Answers (1)

Expedito
Expedito

Reputation: 7795

This code does something like what you're looking for:

function get_content($data){
    $doc = new DOMDocument();
    //load HTML string into document object
    if ( ! @$doc->loadHTML($data)){
        return FALSE;
    }
    //create XPath object using the document object as the parameter
    $xpath = new DOMXPath($doc);
    $query = "//div[@class='unit-id']";
    //XPath queries return a NodeList
    $res = $xpath->query($query);
    $out = array();
    foreach ($res as $key => $node){
        //subquery
        $sub = $xpath->query('.//div[@class="title"]', $node);
        $out[$key]['title'] = trim($sub->item(0)->nodeValue);
        foreach ($node->getElementsByTagName('p') as $key2 => $value){
            $out[$key]['par'][$key2] = $value->nodeValue;
        }
        foreach ($node->getElementsByTagName('li') as $key2 => $value){
            $out[$key]['list'][$key2] = $value->nodeValue;
        }
    }
    return $out;
}

Please note that you have errors in your HTML. You're closing paragraph tags should have the trailing slash </p>.

Here's the output:

array
  0 => 
    array
      'title' => string 'some title-1' (length=12)
      'par' => 
        array
          0 => string 'segwegw1' (length=8)
          1 => string 'segwegw1' (length=8)
          2 => string 'segwegw1' (length=8)
          3 => string 'segwegw1' (length=8)
      'list' => 
        array
          0 => string 'jfjfj' (length=5)
          1 => string 'jfjfj' (length=5)
          2 => string 'jfjfj' (length=5)
  1 => 
    array
      'title' => string 'some title-2' (length=12)
      'par' => 
        array
          0 => string 'segwegw2' (length=8)
          1 => string 'segwegw2' (length=8)
          2 => string 'segwegw2' (length=8)
          3 => string 'segwegw2' (length=8)

Upvotes: 3

Related Questions