Reputation: 315
I need to make a function that finds uppercase acronyms including some containing numbers, but I can only detect only the ones containing only letters.
An example:
s= "the EU needs to contribute part of their GDP to improve the IC3 plan"
I tried
def acronym(s):
return re.findall(r"\b[A-Z]{2,}\b", s)
print(acronym(s))
but I only get
[EU,GDP]
What can I add or change to get
[EU,GDP,IC3]
thanks
Upvotes: 0
Views: 402
Reputation: 306
Try this.
It's similar to both Andrej and S. Pellegrino's answers, however it won't capture number only strings like '123'
and it will capture strings with a digit at any position rather than just at the end.
Explanation of pattern:
\b
- Match a word boundary (the beginning of the string)
(?=.*[A-Z])
- Assert that what follows is anything followed by an uppercase letter (i.e the string contains at least one uppercase letter). This is called positive look ahead.
[A-Z\d]{2,}
- Match an uppercase letter or a digit two or more times.
\b
- Match another word boundary (the end of the string).
import re
def acronym(s):
pattern = r'\b(?=.*[A-Z])[A-Z\d]{2,}\b'
return re.findall(pattern, s)
Edit: add explanation of regex pattern.
Upvotes: 0
Reputation: 638
Try:
import re
def acronym(s):
return re.findall(r"\b(?:[0-9]+[A-Z][A-Z0-9]*)|(?:[A-Z][A-Z0-9]+)\b", s)
print(acronym('3I 33 I3 A GDP W3C'))
output:
['3I', 'I3', 'GDP', 'W3C']
This regex means:
Find any word (between \b
, which are "word boundaries") which either
The ?:
permits us to not capture 2 groups (()|()
), but only one.
Upvotes: 2
Reputation: 195553
This regex won't match numbers (e.g. 123
):
import re
s = "the EU needs to contribute part of their GDP to improve the IC3 plan"
def acronym(s):
return re.findall(r"\b([A-Z]{2,}\d*)\b", s)
print(acronym(s))
Prints:
['EU', 'GDP', 'IC3']
Regex101 link here.
Upvotes: 0