Regex to extract acronyms

Question

I am using regex to extract acronyms(only specific types) from text in python.

ABC (all caps within round brackets or square brackets or between word endings)
A.B.C (same as above but having only one '.' in between)
A&B&C (same as above but having only one '&' in between)

So far I am using

text = "My name is STEVE. My friend works at (I.A.). Indian Army(IA). B&W also B&&W Also I...A"
re.findall('\b[A-Z][A-Z.&]{2,7}\b', text)

Output is : ['STEVE', 'I.A', 'B&W', 'B&&W', 'I...A']
I want to exclude B&&W and I..A, but include (IA).

I am aware of the below links but I am unable to use them correctly. Kindly help.

Extract acronyms patterns from string using regex

Finding Acronyms Using Regex In Python

RegEx to match acronyms

Wiktor Stribiżew · Accepted Answer

If there must be only a &, or a . or empty string between the uppercase letters and they can appear inconsistently (as in this fake NA&T.O string), you can use

re.findall(r'\b[A-Z](?:[&.]?[A-Z])+\b', text)

See the regex demo. It matches a whole word that starts with a single uppercase letter and then has one or more sequences of an optional & or . followed with another uppercase letter.

Here, I would suggest

[x.group() for x in re.finditer(r'\b[A-Z](?=([&.]?))(?:\1[A-Z])+\b', text)]

Or, if you See the regex demo

Pattern details

\b - word boundary
[A-Z] - an uppercase letter
(?=([&.]?)) - a positive lookahead that contains a capturing group that captures into Group 1 an optional & or . char
(?:\1[A-Z])+ - one or more occurrences of
- \1 - same char captured into Group 1 (so, you won't get A.T&W)
- [A-Z] - an uppercase letter
\b - word boundary.

Python demo:

import re
rx = r"\b[A-Z](?=([&.]?))(?:\1[A-Z])+\b"
s = "My name is STEVE. My friend works at (I.A.). Indian Army(IA). B&W also B&&W Also I...A"
print( [x.group() for x in re.finditer(rx, s)] )
# => ['STEVE', 'I.A', 'IA', 'B&W']

Regex to extract acronyms

Answers (2)

Related Questions