Reputation: 5409
I'm building a small parser that scrapes web pages and logs the data on them. One of the things to log is the post title of forums. I'm using a XML parser to look through the DOM and get this information, and I'm storing it like this:
// Strip out the post's title
$title = $page->find('a[rel=bookmark]', 0);
$title = htmlspecialchars_decode(html_entity_decode(trim($title->plaintext)));
This works for the most part, but some posts have certain special HTML character codes like –
which is dash (-
). How would I go about converting these special character codes back into their original strings?
Thanks.
Upvotes: 1
Views: 2332
Reputation: 446
Use html_entity_decode. Here's a quick example.
$string = "hyphenated–words";
$new = html_entity_decode($string);
echo $new;
You should see...
hyphenated–words
Upvotes: 3
Reputation: 541
This might help:
<?php
function clean_up($str){
$str = stripslashes($str);
$str = strtr($str, get_html_translation_table(HTML_ENTITIES));
$str = str_replace( array("\x82", "\x84", "\x85", "\x91", "\x92", "\x93", "\x94", "\x95", "\x96", "\x97"), array("‚", "„", "…", "‘", "’", "“", "”", "•", "–", "—"),$str);
return $str;
}
?>
Upvotes: 0
Reputation: 16933
Documentation is your friend:
html_entity_decode(trim($title->plaintext), ENT_XHTML, YOUR_ENCODING);
^^^^^^^^^^^^^^^^^^^^^^^^
Upvotes: 0