Regular Expressions (regex) Remove the word "and", Non Alphanumeric Characters and White Spaces from a string in Python

Question

In Python, I'm attempting to clean (and, later compare) artists names and want to remove:

non alpha characters, or
white spaces, or
the word "and"

INPUT STRING: Bootsy Collins and The Rubber Band

DESIRED OUTPUT: BootsyCollinsTheRubberBand

import re

s = 'Bootsy Collins and The Rubber Band'
res1 = re.sub(r'[^\w]|\s|\s+(and)\s', "", s)
res2 = re.sub(r'[^\w]|\s|\sand\s', "", s)
res3 = re.sub(r'[^\w]|\s|(and)', "", s)

print("\b", s, "
"
      , "1st: ", res1, "
"
      , "2nd: ", res2, "
"
      , "3rd: ", res3)

Output:
Bootsy Collins and The Rubber Band 
 1st:  BootsyCollinsandTheRubberBand 
 2nd:  BootsyCollinsandTheRubberBand 
 3rd:  BootsyCollinsTheRubberB

SeaBean · Accepted Answer

To support the rules that you set out, instead of just on the sample text quoted, you need a more general regex with the correct flags setting for re.sub call:

re.sub(r'\band\b|\W', '', s, flags=re.IGNORECASE)

Explanation

The flag re.IGNORECASE is set so that you can also remove "And" (and other uppercase/lowercase combination variations) in the sentence. In case you want to remove only "and" but not any variations of it, you can remove this flag setting.
\band\b the word "and" enclosed with word boundary token \b on both sides. This is to match for the 3 characters sequence "and" as an independent word rather than being a substring of another word. Using \b to isolate the word instead of enclosing the word within white spaces like \s+and\s has the advantage that the \b option can also detect also word boundary in strings like and, while \s+and\s can't do. This is because comma is not a white space.
As white space \s is also a kind of non-word \W (since word \w is equivalent to [a-zA-Z0-9_]), you don't need separate regex tokens for both. \W already includes \s. So, you can simplify the regex without separately using \s.

Demo

Test case #1:

s = 'Bootsy Collins and The Rubber Band'
res = re.sub(r'\band\b|\W', '', s, flags=re.IGNORECASE)
print(res)

Output:
'BootsyCollinsTheRubberBand'

Test case #2 ('And' got removed) :

s = 'Bootsy Collins And The Rubber Band'
res = re.sub(r'\band\b|\W', '', s, flags=re.IGNORECASE)
print(res)

Output:
'BootsyCollinsTheRubberBand'

Test case #3 ('and,' [with comma after 'and'] got removed)

s = 'Bootsy Collins and, The Rubber Band'
res = re.sub(r'\band\b|\W', '', s, flags=re.IGNORECASE)
print(res)

Output:
'BootsyCollinsTheRubberBand'

Counter Test case: (regex using white space \s+ or \s instead of \b for word boundary)

s = 'Bootsy Collins and, The Rubber Band'
res = re.sub(r'\s+(and)\s|\W', '',s)
print(res)

Output:   'and' is NOT removed
'BootsyCollinsandTheRubberBand'

Regular Expressions (regex) Remove the word "and", Non Alphanumeric Characters and White Spaces from a string in Python

Answers (2)

Explanation

Demo

Related Questions

Regular Expressions (regex) Remove the word &quot;and&quot;, Non Alphanumeric Characters and White Spaces from a string in Python

Answers (2)

Explanation

Demo

Related Questions

Regular Expressions (regex) Remove the word "and", Non Alphanumeric Characters and White Spaces from a string in Python