Reputation: 11002
For sentences like:
sent = "This i$s a s[[]ample sentence.\nAnd another <<one>>.
\nMoreover, it is 'filtered'!"
I would like to get:
"This is a sample sentence. And another one. Moreover, it is filtered."
Thus, I thought using re.sub
should be the way to go. However, RegEx doesn't work as expected (like it pretty much always does^^).
My idea was to use \W
to match every non-word and then exclude [.,;!?]
to keep the punctuation. The last RegEx I've tried was:
re.sub(r"(\W[^\.\,\;\?\!])", "", sent)
Unfortunately, [^\.\,\;\?\!]
does match for anything that does not contain an entry of[.,;!?]
, instead of simply saying 'do not match these characters literally'.
How can I exclude these characters from match?
Upvotes: 1
Views: 3186
Reputation: 8413
The \W
needs to be integrated into the negated character class. \W
is the same as [^\w]
, so you'll end up with [^\w.,;!?]
. You should repeat this character class, to match contiguous occurences in a single step - [^\w.,;!?]+
.
It seems you also want to keep spaces, so you should add them to your character class.
Reeding deeper into your question, you also want to replace newlines with a space and !
with .
. This makes it a multiple step solution. First filter out anything unwanted [^\w.,;!? \n]+
, in a next step replace \n
with and
!
with .
.
Upvotes: 2