Lincoln Bergeson
Lincoln Bergeson

Reputation: 3451

How can I get a DOMElement's innerText in PHP?

We use a CMS on our site. Many users have added HTML content into the database that is formatted weirdly. For example, putting all their HTML on a single line:

<h1>This is my title</h1><p>First paragraph</p><p>Second paragraph</p>

This renders in the browser correctly, of course. However, I am writing a script in PHP that loads up this data into a DOMDocument like so:

$doc = new DOMDocument();
$doc->loadHTML($row['body_html']);
var_dump($doc->documentElement->textContent);

This shows up as:

This is my titleFirst paragraphSecond paragraph

How can I get documentElement to return innerText, rather than textContent? I believe innerText will return a string with line breaks.

Upvotes: 2

Views: 3885

Answers (2)

user7212232
user7212232

Reputation: 115

The answer is the nodevalue

 $arrDivs = $dom->getElementsByTagName('div'); 

 foreach($arrDivs as $div){
     $text = $div->nodeValue;
     echo $text . PHP_EOL . PHP_EOL;
 }

Upvotes: 0

Lincoln Bergeson
Lincoln Bergeson

Reputation: 3451

As cb0 said:

You should iterate over all elements in the DomDocument and get the text item by item and insert the whitespaces manually. Have a look here for example. DomDocument itself can not know where it should but the whitespaces.

I wrote the following function to recursively traverse the DOMDocument object:

function get_text_from_dom($node, $text) {
  if (!is_null($node->childNodes)) {
    foreach ($node->childNodes as $node) {
      $text = get_text_from_dom($node, $text);
    }
  }
  else {
    return $text . $node->textContent . ' ';
  }
  return $text;
}

And replaced the code in the question with the following:

$doc = new DOMDocument();
$doc->loadHTML($row['body_html']);
var_dump(get_text_from_dom($doc->documentElement));

It is glorious.

Upvotes: 1

Related Questions