Extract text from HTML <p> with a particular title

I have a huge file with lots of entries, they have one thing in common, the first line.
I want to extract all of the text from a paragraph where the first line is:

Type of document: Contract Notice

The HTML code I am working on is here:

<!-- other HTML -->
    <p>
      <b>Type of document:</b>
      " Contract Notice" <br>
      <b>Country</b> <br>
      ... rest of text ...
    </p>
<!-- other HTML -->

I have put the HTML into a DOM like this:

$dom = new DOMDocument;    
$dom->loadHTML($content);

I need to return all of the text in the paragraph node where the first line is 'Type of document: Contract Notice'
I am sure there is a simple way of doing this using DOM methods or XPath, please advise!

Upvotes: 0

Views: 104

Answers (3)

Salman Arshad
Salman Arshad

Reputation: 272036

Speaking of XPath, try the following expression which selects<p> elements:

  • whose <b> child element (first one) has the value Type of document:
    • whose next sibling text node (first one) contains the text Contract Notice
//p[
    b[1][.="Type of document:"]
        /following-sibling::text()[1][contains(., "Contract Notice")]
]

Upvotes: 2

Ayman Bedair
Ayman Bedair

Reputation: 937

I don't like using DomDocument parsing unless I need to heavily parse a document, but if you want to do so then it could be something like:

//Using DomDocument
$doc = new DOMDocument();
$doc->loadHTML($content);
$xpath = new DOMXpath($doc);
$matchedDoms = $xpath->query('//b[text()="Type of document:"]/parent::p//text()');
$data = '';

foreach($matchedDoms as $domMatch) {
    $data .= $domMatch->data . ' ';
}

var_dump($data);

I would prefer a simple regex line to do it all, after all it's just one piece of the document you are looking for:

//Using a Regular Expression
preg_match('/<p>.*<b>Type of document:<\/b>.*Contract Notice(?<data>.*)<\/p>/si', $content, $matches);

var_dump($matches['data']); //If you want everything in there
var_dump(strip_tags($matches['data'])); //If you just want the text

Upvotes: 0

Moritz Petersen
Moritz Petersen

Reputation: 13047

With this XPath expression, you select the text of all children of the p element:

//b[text()="Type of document:"]/parent::p/*/text()

Upvotes: 0

Related Questions