Prince
Prince

Reputation: 186

Regex to extract acronyms

I am using regex to extract acronyms(only specific types) from text in python.

So far I am using

text = "My name is STEVE. My friend works at (I.A.). Indian Army(IA). B&W also B&&W Also I...A"
re.findall('\\b[A-Z][A-Z.&]{2,7}\\b', text)

Output is : ['STEVE', 'I.A', 'B&W', 'B&&W', 'I...A']
I want to exclude B&&W and I..A, but include (IA). 

I am aware of the below links but I am unable to use them correctly. Kindly help.

Extract acronyms patterns from string using regex

Finding Acronyms Using Regex In Python

RegEx to match acronyms

Upvotes: 6

Views: 4725

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627263

If there must be only a &, or a . or empty string between the uppercase letters and they can appear inconsistently (as in this fake NA&T.O string), you can use

re.findall(r'\b[A-Z](?:[&.]?[A-Z])+\b', text)

See the regex demo. It matches a whole word that starts with a single uppercase letter and then has one or more sequences of an optional & or . followed with another uppercase letter.

Here, I would suggest

[x.group() for x in re.finditer(r'\b[A-Z](?=([&.]?))(?:\1[A-Z])+\b', text)]

Or, if you See the regex demo

Pattern details

  • \b - word boundary
  • [A-Z] - an uppercase letter
  • (?=([&.]?)) - a positive lookahead that contains a capturing group that captures into Group 1 an optional & or . char
  • (?:\1[A-Z])+ - one or more occurrences of
    • \1 - same char captured into Group 1 (so, you won't get A.T&W)
    • [A-Z] - an uppercase letter
  • \b - word boundary.

Python demo:

import re
rx = r"\b[A-Z](?=([&.]?))(?:\1[A-Z])+\b"
s = "My name is STEVE. My friend works at (I.A.). Indian Army(IA). B&W also B&&W Also I...A"
print( [x.group() for x in re.finditer(rx, s)] )
# => ['STEVE', 'I.A', 'IA', 'B&W']

Upvotes: 6

SQB
SQB

Reputation: 4078

What you want is a capital followed by a bunch of capitals, with optional dots or ampersands in between.

re.findall('\\b[A-Z](?:[\\.&]?[A-Z]){1,7}\\b', text)

Breaking it down:

  • All back slashes are doubled because they need escaping
  • \b word border
  • [A-Z] capital
  • (?: opening a non-capturing group
  • [\.&] character class containing . and &
  • ? optional
  • [A-Z] followed by another capital
  • ) closing non-capturing group of an optional . or &, followed by a capital
  • {1,7} repeating that group 1 - 7 times
  • \b word border

We want a non-capturing group since re.findall returns groups (if present).

There are better ways of matching capitals that work across all of the Unicode characters.

This does match B&WW and B&W.W, since we do not enforce the use of the (same) character every time. If you want that, the expression gets a bit more complex (though not much).

Upvotes: 9

Related Questions