Stefano Potter
Stefano Potter

Reputation: 3577

Regex: Remove all special characters that are not apostrophes between letters

I have a string like so:

s = "i'm sorry, sir, but this is a 'gluten-free' restaurant. we don't serve bread."

and I am trying to use re.sub to replace all special characters that are not apostrophes between letters with a space, so 'gluten-free' becomes gluten free and i'm will stay as i'm.

I have tried this:

import re

s = re.sub('[^[a-z]+\'?[a-z]+]', ' ', s)

which I am trying to say is to replace anything that is not following the pattern of one and more letters, with then 0 or one apostrophes, followed by one or more letters with white space.

this returns the same string:

i'm sorry, sir, but this is a 'gluten-free' restaurant. we don't serve bread.

I would like to have:

i'm sorry  sir  but this is a  gluten free  restaurant  we don't serve bread 

Upvotes: 1

Views: 1015

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626802

You can use

import re
s = "i'm sorry, sir, but this is a 'gluten-free' restaurant. we don't serve bread."
print( re.sub(r"(?:(?!\b['‘’]\b)[\W_])+", ' ', s).strip() )
# => i'm sorry sir but this is a gluten free restaurant we don't serve bread

See the Python demo and the regex demo.

Details:

  • (?: - start of a non-capturing group:
    • (?!\b['‘’]\b) - a negative lookahead that fails the match if there is an apostrophe within word chars
    • [\W_] - a non-word or _ char
  • )+ - one or more occurrences

Upvotes: 0

anubhava
anubhava

Reputation: 785128

You may use this regex with a nested lookahead+lookbehind:

>>> s = "i'm sorry, sir, but this is a 'gluten-free' restaurant. we don't serve bread."
>>> print ( re.sub(r"(?!(?<=[a-z])'[a-z])[^\w\s]", ' ', s, flags=re.I) )
i'm sorry  sir  but this is a  gluten free  restaurant  we don't serve bread

RegEx Demo

RegEx Details:

  • (?!: Start negative lookahead
    • (?<=[a-z]): Positive lookbehind to assert that we have an alphabet at previous position
    • ': Match an apostrophe
    • [a-z]: Match letter [a-z]
  • ): End negative lookahead
  • [^\w\s]: Match a character that is not a whitespace and not a word character

Upvotes: 2

Related Questions