Reputation: 285
I have a web application that reads the contents of a web page and parses the sentences using an NLP algorithm. I have been using regex to split the contents into single sentences and then parsing them.
I would like to remove characters like Â
from my sentences. These characters, I imagine, are because of the HTML encoding.
I obviously cannot use a regex like [^\w\d]+
or its variations because I need the punctuations intact. Of course I could add individual exceptions for each of the punctuation like [^\w\d\.,:]+
and so on, but I would like it if there is an easier way to do this, like probably a character class that knows it is a... funny character?
Any help will be much appreciated. Thanks.
EDIT: The app is built with PHP and I am using a simple file_get_contents()
to fetch the HTML data from the site and reading the contents inside <p>
tags.
Upvotes: 0
Views: 2390
Reputation: 11
I found this regexpr helpful to identify junk character in a file using atom
[^(\x20-\x7F\p{Sc})]
Upvotes: 1
Reputation: 10401
This was mentioned in the comments by @TheGreatCO but you are able to create a character class of "special" characters. You can use the hex code values to create a range in a character class. So for any special character over ASCII 127 would be this.
[\x80-\xFE]
That would match anything but your most basic characters. For reference sake, here's a list of the ASCII character table with their hex codes.
This page discusses the different ways you can reference special characters in regex.
Upvotes: 1