Reputation: 315

How to find acronyms containing numbers in a string

I need to make a function that finds uppercase acronyms including some containing numbers, but I can only detect only the ones containing only letters.

An example:

s= "the EU needs to contribute part of their GDP to improve the IC3 plan"

I tried

def acronym(s):
    return re.findall(r"\b[A-Z]{2,}\b", s)
print(acronym(s))

but I only get

[EU,GDP]

What can I add or change to get

[EU,GDP,IC3]

thanks

Upvotes: 0

Answers (3)

Marcus Caisey

Reputation: 306

Try this.

It's similar to both Andrej and S. Pellegrino's answers, however it won't capture number only strings like '123' and it will capture strings with a digit at any position rather than just at the end.

Explanation of pattern:

\b - Match a word boundary (the beginning of the string)

(?=.*[A-Z]) - Assert that what follows is anything followed by an uppercase letter (i.e the string contains at least one uppercase letter). This is called positive look ahead.

[A-Z\d]{2,} - Match an uppercase letter or a digit two or more times.

\b - Match another word boundary (the end of the string).

import re

def acronym(s):
    pattern = r'\b(?=.*[A-Z])[A-Z\d]{2,}\b'
    return re.findall(pattern, s)

Edit: add explanation of regex pattern.

Upvotes: 0

S. Pellegrino

Reputation: 638

Try:

import re

def acronym(s):
    return re.findall(r"\b(?:[0-9]+[A-Z][A-Z0-9]*)|(?:[A-Z][A-Z0-9]+)\b", s)

print(acronym('3I 33 I3 A GDP W3C'))

output:

['3I', 'I3', 'GDP', 'W3C']

This regex means:

Find any word (between \b, which are "word boundaries") which either

starts with a digit (or more) and then must have at least one capital letter, and then can have other letters and digits
starts with a capital letter and then has at least another capital letter or digit.

The ?: permits us to not capture 2 groups (()|()), but only one.

Upvotes: 2

Andrej Kesely

Reputation: 195553

This regex won't match numbers (e.g. 123):

import re

s = "the EU needs to contribute part of their GDP to improve the IC3 plan"

def acronym(s):
    return re.findall(r"\b([A-Z]{2,}\d*)\b", s)

print(acronym(s))

Prints:

['EU', 'GDP', 'IC3']

Regex101 link here.

Upvotes: 0

How to find acronyms containing numbers in a string

Answers (3)

Related Questions