daniel451
daniel451

Reputation: 11002

RegEx for matching all non-words except punctuation?

For sentences like:

sent = "This i$s a s[[]ample sentence.\nAnd another <<one>>.
        \nMoreover, it is 'filtered'!"

I would like to get:

"This is a sample sentence. And another one. Moreover, it is filtered."

Thus, I thought using re.sub should be the way to go. However, RegEx doesn't work as expected (like it pretty much always does^^).

My idea was to use \W to match every non-word and then exclude [.,;!?] to keep the punctuation. The last RegEx I've tried was:

re.sub(r"(\W[^\.\,\;\?\!])", "", sent)

Unfortunately, [^\.\,\;\?\!] does match for anything that does not contain an entry of[.,;!?], instead of simply saying 'do not match these characters literally'.

How can I exclude these characters from match?

Upvotes: 1

Views: 3186

Answers (1)

Sebastian Proske
Sebastian Proske

Reputation: 8413

The \W needs to be integrated into the negated character class. \W is the same as [^\w], so you'll end up with [^\w.,;!?]. You should repeat this character class, to match contiguous occurences in a single step - [^\w.,;!?]+.

It seems you also want to keep spaces, so you should add them to your character class.

Reeding deeper into your question, you also want to replace newlines with a space and ! with .. This makes it a multiple step solution. First filter out anything unwanted [^\w.,;!? \n]+, in a next step replace \n with and ! with ..

Upvotes: 2

Related Questions