Reputation: 3451
We use a CMS on our site. Many users have added HTML content into the database that is formatted weirdly. For example, putting all their HTML on a single line:
<h1>This is my title</h1><p>First paragraph</p><p>Second paragraph</p>
This renders in the browser correctly, of course. However, I am writing a script in PHP that loads up this data into a DOMDocument like so:
$doc = new DOMDocument();
$doc->loadHTML($row['body_html']);
var_dump($doc->documentElement->textContent);
This shows up as:
This is my titleFirst paragraphSecond paragraph
How can I get documentElement
to return innerText
, rather than textContent
? I believe innerText
will return a string with line breaks.
Upvotes: 2
Views: 3885
Reputation: 115
The answer is the nodevalue
$arrDivs = $dom->getElementsByTagName('div');
foreach($arrDivs as $div){
$text = $div->nodeValue;
echo $text . PHP_EOL . PHP_EOL;
}
Upvotes: 0
Reputation: 3451
As cb0 said:
You should iterate over all elements in the DomDocument and get the text item by item and insert the whitespaces manually. Have a look here for example. DomDocument itself can not know where it should but the whitespaces.
I wrote the following function to recursively traverse the DOMDocument object:
function get_text_from_dom($node, $text) {
if (!is_null($node->childNodes)) {
foreach ($node->childNodes as $node) {
$text = get_text_from_dom($node, $text);
}
}
else {
return $text . $node->textContent . ' ';
}
return $text;
}
And replaced the code in the question with the following:
$doc = new DOMDocument();
$doc->loadHTML($row['body_html']);
var_dump(get_text_from_dom($doc->documentElement));
It is glorious.
Upvotes: 1