Reputation: 186
I am using regex to extract acronyms(only specific types) from text in python.
So far I am using
text = "My name is STEVE. My friend works at (I.A.). Indian Army(IA). B&W also B&&W Also I...A"
re.findall('\\b[A-Z][A-Z.&]{2,7}\\b', text)
Output is : ['STEVE', 'I.A', 'B&W', 'B&&W', 'I...A']
I want to exclude B&&W and I..A, but include (IA).
I am aware of the below links but I am unable to use them correctly. Kindly help.
Extract acronyms patterns from string using regex
Finding Acronyms Using Regex In Python
Upvotes: 6
Views: 4725
Reputation: 627263
If there must be only a &
, or a .
or empty string between the uppercase letters and they can appear inconsistently (as in this fake NA&T.O
string), you can use
re.findall(r'\b[A-Z](?:[&.]?[A-Z])+\b', text)
See the regex demo. It matches a whole word that starts with a single uppercase letter and then has one or more sequences of an optional &
or .
followed with another uppercase letter.
Here, I would suggest
[x.group() for x in re.finditer(r'\b[A-Z](?=([&.]?))(?:\1[A-Z])+\b', text)]
Or, if you See the regex demo
Pattern details
\b
- word boundary[A-Z]
- an uppercase letter(?=([&.]?))
- a positive lookahead that contains a capturing group that captures into Group 1 an optional &
or .
char(?:\1[A-Z])+
- one or more occurrences of
\1
- same char captured into Group 1 (so, you won't get A.T&W
)[A-Z]
- an uppercase letter\b
- word boundary.import re
rx = r"\b[A-Z](?=([&.]?))(?:\1[A-Z])+\b"
s = "My name is STEVE. My friend works at (I.A.). Indian Army(IA). B&W also B&&W Also I...A"
print( [x.group() for x in re.finditer(rx, s)] )
# => ['STEVE', 'I.A', 'IA', 'B&W']
Upvotes: 6
Reputation: 4078
What you want is a capital followed by a bunch of capitals, with optional dots or ampersands in between.
re.findall('\\b[A-Z](?:[\\.&]?[A-Z]){1,7}\\b', text)
Breaking it down:
\b
word border[A-Z]
capital(?:
opening a non-capturing group[\.&]
character class containing .
and &
?
optional[A-Z]
followed by another capital)
closing non-capturing group of an optional .
or &
, followed by a capital{1,7}
repeating that group 1 - 7 times\b
word borderWe want a non-capturing group since re.findall
returns groups (if present).
There are better ways of matching capitals that work across all of the Unicode characters.
This does match B&WW
and B&W.W
, since we do not enforce the use of the (same) character every time. If you want that, the expression gets a bit more complex (though not much).
Upvotes: 9