Reputation: 2148
I have a huge file with lots of entries, they have one thing in common, the first line.
I want to extract all of the text from a paragraph where the first line is:
Type of document: Contract Notice
The HTML
code I am working on is here:
<!-- other HTML -->
<p>
<b>Type of document:</b>
" Contract Notice" <br>
<b>Country</b> <br>
... rest of text ...
</p>
<!-- other HTML -->
I have put the HTML
into a DOM
like this:
$dom = new DOMDocument;
$dom->loadHTML($content);
I need to return all of the text in the paragraph node where the first line is 'Type of document: Contract Notice'
I am sure there is a simple way of doing this using DOM
methods or XPath
, please advise!
Upvotes: 0
Views: 104
Reputation: 272036
Speaking of XPath, try the following expression which selects<p>
elements:
<b>
child element (first one) has the value Type of document:
Contract Notice
//p[
b[1][.="Type of document:"]
/following-sibling::text()[1][contains(., "Contract Notice")]
]
Upvotes: 2
Reputation: 937
I don't like using DomDocument
parsing unless I need to heavily parse a document, but if you want to do so then it could be something like:
//Using DomDocument
$doc = new DOMDocument();
$doc->loadHTML($content);
$xpath = new DOMXpath($doc);
$matchedDoms = $xpath->query('//b[text()="Type of document:"]/parent::p//text()');
$data = '';
foreach($matchedDoms as $domMatch) {
$data .= $domMatch->data . ' ';
}
var_dump($data);
I would prefer a simple regex line to do it all, after all it's just one piece of the document you are looking for:
//Using a Regular Expression
preg_match('/<p>.*<b>Type of document:<\/b>.*Contract Notice(?<data>.*)<\/p>/si', $content, $matches);
var_dump($matches['data']); //If you want everything in there
var_dump(strip_tags($matches['data'])); //If you just want the text
Upvotes: 0
Reputation: 13047
With this XPath expression, you select the text of all children of the p
element:
//b[text()="Type of document:"]/parent::p/*/text()
Upvotes: 0