Reputation: 43
I have been building a function reads in the title text as found on a webpage between the <title></title>
tags. I am using the following regex code to grab the title text form the html page:
if(preg_match('#<title>([^<]+)</title>#simU', $this->html, $m1))
$this->title = trim($m1[1]);
I am using the following to encode the value for the mysql insert statement:
mysql_real_escape_string(rawurldecode($this->title))
So that leaves me with a database full of titles that have html entities(&nsbp etc...) and
foreign characters such as in
Dating S.o.s | Gluten-free, Dairy-free, Sugar-free Recipes And Lifestyle Tips
The goal is to decode,remove, clean the titles so that they look as close to perfect english as possible.
I have constructed a function that uses the following 2 regex's to remove html entities and limit junk respectively. And while not ideal(because it removes the html entities rather than preserves them) it's the closest to clean as I've got.
$string = preg_replace("/&#?[a-z0-9]+;/i","",$string);
//remove all non-normal chars
$string = preg_replace('/[^a-zA-Z0-9-\s\'\!\,\|\(\)\.\*\&\#\/\:]/', '', $string);
But the non-english chars still exist.
Would anyone be able to offer help as to:
Thanks much for your help!
Upvotes: 1
Views: 983
Reputation: 11478
Check out http://www.php.net/manual/en/function.html-entity-decode.php for #1
And http://php.net/manual/en/function.mb-convert-encoding.php for #2
Upvotes: 0
Reputation: 35790
For point 1, PHP has an html_entity_decode() function that you can use to turn HTML entities into "regular" characters.
Upvotes: 1