James Dawson
James Dawson

Reputation: 5409

Converting special HTML characters back into their original strings

I'm building a small parser that scrapes web pages and logs the data on them. One of the things to log is the post title of forums. I'm using a XML parser to look through the DOM and get this information, and I'm storing it like this:

// Strip out the post's title
$title = $page->find('a[rel=bookmark]', 0);
$title = htmlspecialchars_decode(html_entity_decode(trim($title->plaintext)));

This works for the most part, but some posts have certain special HTML character codes like – which is dash (-). How would I go about converting these special character codes back into their original strings?

Thanks.

Upvotes: 1

Views: 2332

Answers (3)

mynewaccount
mynewaccount

Reputation: 446

Use html_entity_decode. Here's a quick example.

$string = "hyphenated&#8211words";

$new = html_entity_decode($string);

echo $new;

You should see...

hyphenated–words

Upvotes: 3

DLL
DLL

Reputation: 541

This might help:

<?php
 function clean_up($str){
 $str = stripslashes($str);
 $str = strtr($str, get_html_translation_table(HTML_ENTITIES));
 $str = str_replace( array("\x82", "\x84", "\x85", "\x91", "\x92", "\x93", "\x94", "\x95", "\x96",  "\x97"), array("&#8218;", "&#8222;", "&#8230;", "&#8216;", "&#8217;", "&#8220;", "&#8221;", "&#8226;", "&#8211;", "&#8212;"),$str);
return $str;
}
?>

Upvotes: 0

Peter
Peter

Reputation: 16933

Documentation is your friend:

html_entity_decode(trim($title->plaintext), ENT_XHTML, YOUR_ENCODING);
                                            ^^^^^^^^^^^^^^^^^^^^^^^^

Upvotes: 0

Related Questions