blah
blah

Reputation: 664

Remove inconsistent acronyms in strings using regex

I would like to remove all acronyms, even those that are written inconsistently. For instance, in the list below (text), some of the acronyms miss an opening or a closing bracket, thus I would like the remove those too. I am only able to remove those with both closing brackets.

How can I adapt my current re expression so that it does not only focus on upper case chars with 2 surrounding brackets?

import re

text = ['Spain (ES)', 'Netherlands (NL .', 'United States (USA.', 'Russia RU)']  

for string in text:
  cleaned_acronyms = re.sub(r'\([A-Z]*\)', '', string) #remove uppercase chars with ( ). 
  print(cleaned_acronyms)

#current output
>>> Spain 
>>> Netherlands (NL .
>>> United States (USA.
>>> Russia RU)

Desired output:

>>> Spain
>>> Netherlands
>>> United States
>>> Russia

Upvotes: 1

Views: 188

Answers (2)

Jan
Jan

Reputation: 43179

You might get along with

 \(?\b[A-Z.]{2,3}\b.+

See a demo on regex101.com.

Upvotes: 2

The fourth bird
The fourth bird

Reputation: 163517

You could match the uppercase chars between parenthesis with either one per side, followed by the rest of the line.

\s*(?:\([A-Z]{2,}|[A-Z]{2,}\)).*

Regex demo

For example

import re

text = ['Spain (ES)', 'Netherlands (NL .', 'United States (USA.', 'Russia RU)']

for string in text:
    cleaned_acronyms = re.sub(r'\s*(?:\([A-Z]{2,}|[A-Z]{2,}\)).*', '', string)
    print(cleaned_acronyms)

Output

Spain
Netherlands
United States
Russia

Upvotes: 2

Related Questions