Reputation: 43

Cleaning text scraped from webpage with php & regex

I have been building a function reads in the title text as found on a webpage between the <title></title> tags. I am using the following regex code to grab the title text form the html page:

 if(preg_match('#<title>([^<]+)</title>#simU', $this->html, $m1))
      $this->title = trim($m1[1]);

I am using the following to encode the value for the mysql insert statement:

mysql_real_escape_string(rawurldecode($this->title))

So that leaves me with a database full of titles that have html entities(&nsbp etc...) and foreign characters such as in Dating S.o.sÂ |Â Gluten-free, Dairy-free, Sugar-free Recipes And Lifestyle Tips

The goal is to decode,remove, clean the titles so that they look as close to perfect english as possible.

I have constructed a function that uses the following 2 regex's to remove html entities and limit junk respectively. And while not ideal(because it removes the html entities rather than preserves them) it's the closest to clean as I've got.

$string = preg_replace("/&#?[a-z0-9]+;/i","",$string);
//remove all non-normal chars
$string = preg_replace('/[^a-zA-Z0-9-\s\'\!\,\|\(\)\.\*\&\#\/\:]/', '', $string);

But the non-english chars still exist.

Would anyone be able to offer help as to:

Best way to save these title strings to the db trying to preserve the english intent (punctuation, apostrophies, etc...)
How to convert or eliminate the strange chars as shown in my example title above?

Thanks much for your help!

Upvotes: 1