nhandy
nhandy

Reputation: 83

Regular Expressions (regex) Remove the word "and", Non Alphanumeric Characters and White Spaces from a string in Python

In Python, I'm attempting to clean (and, later compare) artists names and want to remove:

  1. non alpha characters, or
  2. white spaces, or
  3. the word "and"

INPUT STRING: Bootsy Collins and The Rubber Band

DESIRED OUTPUT: BootsyCollinsTheRubberBand

import re

s = 'Bootsy Collins and The Rubber Band'
res1 = re.sub(r'[^\w]|\s|\s+(and)\s', "", s)
res2 = re.sub(r'[^\w]|\s|\sand\s', "", s)
res3 = re.sub(r'[^\w]|\s|(and)', "", s)

print("\b", s, "\n"
      , "1st: ", res1, "\n"
      , "2nd: ", res2, "\n"
      , "3rd: ", res3)
Output:
Bootsy Collins and The Rubber Band 
 1st:  BootsyCollinsandTheRubberBand 
 2nd:  BootsyCollinsandTheRubberBand 
 3rd:  BootsyCollinsTheRubberB

Upvotes: 3

Views: 153

Answers (2)

SeaBean
SeaBean

Reputation: 23217

To support the rules that you set out, instead of just on the sample text quoted, you need a more general regex with the correct flags setting for re.sub call:

re.sub(r'\band\b|\W', '', s, flags=re.IGNORECASE)

Explanation

  • The flag re.IGNORECASE is set so that you can also remove "And" (and other uppercase/lowercase combination variations) in the sentence. In case you want to remove only "and" but not any variations of it, you can remove this flag setting.
  • \band\b the word "and" enclosed with word boundary token \b on both sides. This is to match for the 3 characters sequence "and" as an independent word rather than being a substring of another word. Using \b to isolate the word instead of enclosing the word within white spaces like \s+and\s has the advantage that the \b option can also detect also word boundary in strings like and, while \s+and\s can't do. This is because comma is not a white space.
  • As white space \s is also a kind of non-word \W (since word \w is equivalent to [a-zA-Z0-9_]), you don't need separate regex tokens for both. \W already includes \s. So, you can simplify the regex without separately using \s.

Demo

Test case #1:

s = 'Bootsy Collins and The Rubber Band'
res = re.sub(r'\band\b|\W', '', s, flags=re.IGNORECASE)
print(res)

Output:
'BootsyCollinsTheRubberBand'

Test case #2 ('And' got removed) :

s = 'Bootsy Collins And The Rubber Band'
res = re.sub(r'\band\b|\W', '', s, flags=re.IGNORECASE)
print(res)

Output:
'BootsyCollinsTheRubberBand'

Test case #3 ('and,' [with comma after 'and'] got removed)

s = 'Bootsy Collins and, The Rubber Band'
res = re.sub(r'\band\b|\W', '', s, flags=re.IGNORECASE)
print(res)

Output:
'BootsyCollinsTheRubberBand'

Counter Test case: (regex using white space \s+ or \s instead of \b for word boundary)

s = 'Bootsy Collins and, The Rubber Band'
res = re.sub(r'\s+(and)\s|\W', '',s)
print(res)

Output:   'and' is NOT removed
'BootsyCollinsandTheRubberBand'            

Upvotes: 3

trincot
trincot

Reputation: 350290

Your first two regular expressions don't match the " and " because when arriving at that position in the string, the \s part of the regex will match the space before "and" instead of the \s+(and)\s part of your regex.

You simply need to change the order, so that the latter is tried first. Also, \s is part of [^\w], so you don't need to match \s separately. And finally, \W is the shorter form of [^\w]. So use:

\s+(and)\s|\W 

Upvotes: 3

Related Questions