How can I remove junk characters with regex?

Question

I have a web application that reads the contents of a web page and parses the sentences using an NLP algorithm. I have been using regex to split the contents into single sentences and then parsing them.

I would like to remove characters like Â from my sentences. These characters, I imagine, are because of the HTML encoding.

I obviously cannot use a regex like [^\w\d]+ or its variations because I need the punctuations intact. Of course I could add individual exceptions for each of the punctuation like [^\w\d\.,:]+ and so on, but I would like it if there is an easier way to do this, like probably a character class that knows it is a... funny character?

Any help will be much appreciated. Thanks.

EDIT: The app is built with PHP and I am using a simple file_get_contents() to fetch the HTML data from the site and reading the contents inside

tags.

Matthew Green · Accepted Answer

This was mentioned in the comments by @TheGreatCO but you are able to create a character class of "special" characters. You can use the hex code values to create a range in a character class. So for any special character over ASCII 127 would be this.

[\x80-\xFE]

That would match anything but your most basic characters. For reference sake, here's a list of the ASCII character table with their hex codes.

This page discusses the different ways you can reference special characters in regex.

How can I remove junk characters with regex?

Answers (2)

Related Questions