Reputation: 17
I am trying to use re.findall to get all of the Capitalized words and abbreviations. I have figured out regular expressions to find each individually, but when I try to combine the two, I end up being returned tuples with an empty string and then the item that I wanted to find.
Here is my regular expression that seems to not work- I imagine its a quick fix I am just unaware of:
x = re.findall("([A-Z][A-Za-z]+\.?)|(\\b[A-Z](?:[\\.&]?[A-Z]){2,}\\b)", txt) #just has extra "" in each set
edit:
I am currently using this as my test case:
"USA. U.S.A America."
This is my output:
[('USA.', ''), ('', 'U.S.A'), ('America.', '')]
Upvotes: 1
Views: 102
Reputation: 10709
Use (?:...)
to not capture a group as documented.
Here is a simplified version of the combined regex searches of the following:
.
)We wouldn't capture those searches individually by putting (?:...)
per search group. What we would do instead is capture the result of both groups e.g. ( (?:...) | (?:...) )
where the first (?:...)
is for the capital letter search and the second (?:...)
is for the acronym search.
import re
txt = "USA. U.S.A America. arctic u.s.a Mars v.. A.b earth c.D.e. .pluto nep.tune. uranus. f.g.h.i Sun "
matches = re.findall("((?:[A-Z]\w+)|(?:\w+\.+\w+[\w\.]*))", txt)
print(matches)
['USA', 'U.S.A', 'America', 'u.s.a', 'Mars', 'A.b', 'c.D.e.', 'nep.tune.', 'f.g.h.i', 'Sun']
Upvotes: 0
Reputation: 6930
In your regular expression, you have two sets of capturing (...)
, one for each alternative, so re.findall()
returns a tuple of them. This is useful if you need to match several parts of a string, or if you need to know which alternative was the one that matched.
In order to get just a flat list, you'll need to either omit those or turn them into non-capturing (?:...)
:
x = re.findall("[A-Z][A-Za-z]+\.?|\\b[A-Z](?:[\\.&]?[A-Z]){2,}\\b", txt)
or, if the (...)
were significant (or you want them for clarity):
x = re.findall("(?:[A-Z][A-Za-z]+\.?)|(?:\\b[A-Z](?:[\\.&]?[A-Z]){2,}\\b)", txt)
Either of these returns the value: ['USA.', 'U.S.A', 'America.']
Upvotes: 1