Reputation: 83
In Python, I'm attempting to clean (and, later compare) artists names and want to remove:
INPUT STRING: Bootsy Collins and The Rubber Band
DESIRED OUTPUT: BootsyCollinsTheRubberBand
import re
s = 'Bootsy Collins and The Rubber Band'
res1 = re.sub(r'[^\w]|\s|\s+(and)\s', "", s)
res2 = re.sub(r'[^\w]|\s|\sand\s', "", s)
res3 = re.sub(r'[^\w]|\s|(and)', "", s)
print("\b", s, "\n"
, "1st: ", res1, "\n"
, "2nd: ", res2, "\n"
, "3rd: ", res3)
Output:
Bootsy Collins and The Rubber Band
1st: BootsyCollinsandTheRubberBand
2nd: BootsyCollinsandTheRubberBand
3rd: BootsyCollinsTheRubberB
Upvotes: 3
Views: 153
Reputation: 23217
To support the rules that you set out, instead of just on the sample text quoted, you need a more general regex with the correct flags setting for re.sub
call:
re.sub(r'\band\b|\W', '', s, flags=re.IGNORECASE)
re.IGNORECASE
is set so that you can also remove "And" (and other uppercase/lowercase combination variations) in the sentence. In case you want to remove only "and" but not any variations of it, you can remove this flag setting.\band\b
the word "and" enclosed with word boundary token \b
on both sides. This is to match for the 3 characters sequence "and" as an independent word rather than being a substring of another word. Using \b
to isolate the word instead of enclosing the word within white spaces like \s+and\s
has the advantage that the \b
option can also detect also word boundary in strings like and,
while \s+and\s
can't do. This is because comma is not a white space.\s
is also a kind of non-word \W
(since word \w
is equivalent to [a-zA-Z0-9_]
), you don't need separate regex tokens for both. \W
already includes \s
. So, you can simplify the regex without separately using \s
.Test case #1:
s = 'Bootsy Collins and The Rubber Band'
res = re.sub(r'\band\b|\W', '', s, flags=re.IGNORECASE)
print(res)
Output:
'BootsyCollinsTheRubberBand'
Test case #2 ('And' got removed) :
s = 'Bootsy Collins And The Rubber Band'
res = re.sub(r'\band\b|\W', '', s, flags=re.IGNORECASE)
print(res)
Output:
'BootsyCollinsTheRubberBand'
Test case #3 ('and,' [with comma after 'and'] got removed)
s = 'Bootsy Collins and, The Rubber Band'
res = re.sub(r'\band\b|\W', '', s, flags=re.IGNORECASE)
print(res)
Output:
'BootsyCollinsTheRubberBand'
Counter Test case: (regex using white space \s+
or \s
instead of \b
for word boundary)
s = 'Bootsy Collins and, The Rubber Band'
res = re.sub(r'\s+(and)\s|\W', '',s)
print(res)
Output: 'and' is NOT removed
'BootsyCollinsandTheRubberBand'
Upvotes: 3
Reputation: 350290
Your first two regular expressions don't match the " and " because when arriving at that position in the string, the \s
part of the regex will match the space before "and" instead of the \s+(and)\s
part of your regex.
You simply need to change the order, so that the latter is tried first. Also, \s
is part of [^\w]
, so you don't need to match \s
separately. And finally, \W
is the shorter form of [^\w]
. So use:
\s+(and)\s|\W
Upvotes: 3