Reputation: 526
I want to find so called Acronyms in text is this the correct way of defining the regex for it? My idea is that if something starts with capital and ends with capital letter it is acronym. Is this correct?
import re
test_string = "Department of Something is called DOS,
or DoS, or (DiS) or D.O.S. in United State of America, U.S.A./ USA"
pattern3=r'([A-Z][a-zA-Z]*[A-Z]|(?:[A-Z]\.)+)'
print re.findall(pattern3, test_string)
and the out put is:
['DOS', 'DoS', 'DiS', 'D.O.S.', 'U.S.A.', 'USA']
Upvotes: 0
Views: 2187
Reputation: 5588
Think that you can use the word boundary \b anchor for what you want to do
>>> regex = r"\b[A-Z][a-zA-Z\.]*[A-Z]\b\.?"
>>> re.findall(regex, "AbIA AoP U.S.A.")
['AbIA', 'AoP', 'U.S.A.']
Upvotes: 2