djmzfKnm
djmzfKnm

Reputation: 27195

Extracting text from html?

I have a string as below

<p>&nbsp;Hello World, this is StackOverflow&#39;s question details page</p>

I want to extract text from above HTML as Hello World, this is StackOverflow's question details page notice that I want to remove the &nbsp; as well.

How we can achieve this in PHP, I tried few functions, strip_tags, html_entity_decode etc, but all are failing in some conditions.

Please help, Thanks!

Edited my code which I am trying is as below, but its not working :( It leaves the &nbsp; and &#39; this type of characters.

$TMP_DESCR = trim(strip_tags($rs['description']));

Upvotes: 0

Views: 546

Answers (4)

Aaron W.
Aaron W.

Reputation: 9299

Below worked for me...had to do a str_replace on the non-breaking space though.

$string = "<p>&nbsp;Hello World, this is StackOverflow&#39;s question details page</p>";
echo htmlspecialchars_decode(trim(strip_tags(str_replace('&nbsp;', '', $string))), ENT_QUOTES);

Upvotes: 1

lonesomeday
lonesomeday

Reputation: 238035

Probably the nicest and most reliable way to do this is with genuine (X|HT)ML parsing functions like the DOMDocument class:

<?php

$str = "<p>&nbsp;Hello World, this is StackOverflow&#39;s question details page</p>";

$dom = new DOMDocument;
$dom->loadXML(str_replace('&nbsp;', ' ', $str));

echo trim($dom->firstChild->nodeValue);
// "Hello World, this is StackOverflow's question details pages"

This is probably slight overkill for this problem, but using the proper parsing functionality is a good habit to get into.


Edit: You can reuse the DOMDocument object, so you only need two lines within the loop:

$dom = new DOMDocument;
while ($rs = mysql_fetch_assoc($result)) { // or whatever
    $dom->loadHTML(str_replace('&nbsp;', ' ', $rs['description']));
    $TMP_DESCR = $dom->firstChild->nodeValue;

    // do something with $TMP_DESCR
}

Upvotes: 0

Rui Jiang
Rui Jiang

Reputation: 1672

First, you'll have to call trim() on the HTML to remove the white space. http://php.net/manual/en/function.trim.php

Then strip_tags, then html_entity_decode.

So: html_entity_decode(strip_tags(trim(html)));

Upvotes: 0

sevenseacat
sevenseacat

Reputation: 25049

strip_tags() will get rid of the tags, and trim() should get rid of the whitespace. I'm not sure if it will work with non-breaking spaces though.

Upvotes: 0

Related Questions