Conversation Company
Conversation Company

Reputation: 285

How can I remove junk characters with regex?

I have a web application that reads the contents of a web page and parses the sentences using an NLP algorithm. I have been using regex to split the contents into single sentences and then parsing them.

I would like to remove characters like  from my sentences. These characters, I imagine, are because of the HTML encoding.

I obviously cannot use a regex like [^\w\d]+ or its variations because I need the punctuations intact. Of course I could add individual exceptions for each of the punctuation like [^\w\d\.,:]+ and so on, but I would like it if there is an easier way to do this, like probably a character class that knows it is a... funny character?

Any help will be much appreciated. Thanks.

EDIT: The app is built with PHP and I am using a simple file_get_contents() to fetch the HTML data from the site and reading the contents inside <p> tags.

Upvotes: 0

Views: 2390

Answers (2)

kavya
kavya

Reputation: 11

I found this regexpr helpful to identify junk character in a file using atom

[^(\x20-\x7F\p{Sc})]

Upvotes: 1

Matthew Green
Matthew Green

Reputation: 10401

This was mentioned in the comments by @TheGreatCO but you are able to create a character class of "special" characters. You can use the hex code values to create a range in a character class. So for any special character over ASCII 127 would be this.

[\x80-\xFE]

That would match anything but your most basic characters. For reference sake, here's a list of the ASCII character table with their hex codes.

This page discusses the different ways you can reference special characters in regex.

Upvotes: 1

Related Questions