Dmitry Makovetskiyd
Dmitry Makovetskiyd

Reputation: 7053

Getting all the paragraphs in a string extract

I am taking a few paragraphs from a database and try to seperate the paragraphs into an array with regex and different classes..but nothing works.

I tried to do this:

   public function get_first_para(){
        $doc = new DOMDocument();
    $doc->loadHTML($this->review);
    foreach($doc->getElementsByTagName('p') as $paragraph) {
      echo $paragraph."<br/><br/><br/>";
    } 
 }

But I get this:

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Unexpected end tag : p in Entity, line: 9 in C:\Inetpub\vhosts\bestcamdirectory.com\httpdocs\sandbox\model\ReviewContentExtractor.php on line 18

Catchable fatal error: Object of class DOMElement could not be converted to string in C:\Inetpub\vhosts\bestcamdirectory.com\httpdocs\sandbox\model\ReviewContentExtractor.php on line 20

Why do I get the message, Is there an easy way to extract all the paragraphs from a string?

UPDATE:

   public function get_first_para(){
         $pattern="/<p>(.+?)<\/p>/i";
         preg_match_all($pattern,$this->review,$matches,PREG_PATTERN_ORDER);
         return $matches;
     }

I would prefer the second way..But it doesnt work well too..

Upvotes: 2

Views: 4693

Answers (1)

complex857
complex857

Reputation: 20753

The DOMDocument::getElementsByTagName returns a DOMNodeList object which is iterable but not an array. In the foreach the $paragraph variabl is an istance of DOMElement so simply using it as a string won't work (as the error explains).

What you want is the text content of the DOMElement, which is available trough the textContent property of those (inherited from DOMNode class):

foreach($doc->getElementsByTagName('p') as $paragraph) {
  echo $paragraph->textContent."<br/><br/><br/>"; // for text only
} 

Or if you need the full content of the DOMNode you can use DOMDocument::saveHTML:

foreach($doc->getElementsByTagName('p') as $paragraph) {
    echo $doc->saveHTML($paragraph)."<br/><br/><br/>\n"; // with the <p> tag

    // without the <p>
    // if you don't need the containing <p> tag, you can iterate trough it's childs and output them
    foreach ($paragraph->childNodes as $cnode) {
         echo $doc->saveHTML($cnode); 
    }
}

As for your loadHTML error, the html input is invalid, you can suppress warnings with:

libxml_use_internal_errors(true); // before loading the html content

If you need these errors, see the libxml's error handling part of the manual.

Edit

Since you insists on regexps here's how you could go about it:

preg_match_all('!<p>(.+?)</p>!sim',$html,$matches,PREG_PATTERN_ORDER);

The pattern modifiers: m means multiline, s means the . can match line ends, i for case insensitivity.

Upvotes: 4

Related Questions