Reputation: 664
I would like to remove all acronyms, even those that are written inconsistently. For instance, in the list below (text
), some of the acronyms miss an opening or a closing bracket, thus I would like the remove those too. I am only able to remove those with both closing brackets.
How can I adapt my current re expression so that it does not only focus on upper case chars with 2 surrounding brackets?
import re
text = ['Spain (ES)', 'Netherlands (NL .', 'United States (USA.', 'Russia RU)']
for string in text:
cleaned_acronyms = re.sub(r'\([A-Z]*\)', '', string) #remove uppercase chars with ( ).
print(cleaned_acronyms)
#current output
>>> Spain
>>> Netherlands (NL .
>>> United States (USA.
>>> Russia RU)
Desired output:
>>> Spain
>>> Netherlands
>>> United States
>>> Russia
Upvotes: 1
Views: 188
Reputation: 163517
You could match the uppercase chars between parenthesis with either one per side, followed by the rest of the line.
\s*(?:\([A-Z]{2,}|[A-Z]{2,}\)).*
For example
import re
text = ['Spain (ES)', 'Netherlands (NL .', 'United States (USA.', 'Russia RU)']
for string in text:
cleaned_acronyms = re.sub(r'\s*(?:\([A-Z]{2,}|[A-Z]{2,}\)).*', '', string)
print(cleaned_acronyms)
Output
Spain
Netherlands
United States
Russia
Upvotes: 2