vynabhnnqwxleicntw
vynabhnnqwxleicntw

Reputation: 17

Issue with regular expressions returning an extra empty string

I am trying to use re.findall to get all of the Capitalized words and abbreviations. I have figured out regular expressions to find each individually, but when I try to combine the two, I end up being returned tuples with an empty string and then the item that I wanted to find.

Here is my regular expression that seems to not work- I imagine its a quick fix I am just unaware of:

x = re.findall("([A-Z][A-Za-z]+\.?)|(\\b[A-Z](?:[\\.&]?[A-Z]){2,}\\b)", txt) #just has extra "" in each set

edit:

I am currently using this as my test case:

"USA. U.S.A America."

This is my output:

[('USA.', ''), ('', 'U.S.A'), ('America.', '')]

Upvotes: 1

Views: 102

Answers (2)

Niel Godfrey P. Ponciano
Niel Godfrey P. Ponciano

Reputation: 10709

Use (?:...) to not capture a group as documented.

Here is a simplified version of the combined regex searches of the following:

  • Any word that starts with a capital letter
  • Any word that is an abbreviation/acronym marked by a separator dot (.)

We wouldn't capture those searches individually by putting (?:...) per search group. What we would do instead is capture the result of both groups e.g. ( (?:...) | (?:...) ) where the first (?:...) is for the capital letter search and the second (?:...) is for the acronym search.

import re

txt = "USA. U.S.A   America. arctic u.s.a Mars v.. A.b earth c.D.e. .pluto nep.tune. uranus. f.g.h.i Sun  "
matches = re.findall("((?:[A-Z]\w+)|(?:\w+\.+\w+[\w\.]*))", txt)
print(matches)
['USA', 'U.S.A', 'America', 'u.s.a', 'Mars', 'A.b', 'c.D.e.', 'nep.tune.', 'f.g.h.i', 'Sun']

Upvotes: 0

Jiří Baum
Jiří Baum

Reputation: 6930

In your regular expression, you have two sets of capturing (...), one for each alternative, so re.findall() returns a tuple of them. This is useful if you need to match several parts of a string, or if you need to know which alternative was the one that matched.

In order to get just a flat list, you'll need to either omit those or turn them into non-capturing (?:...):

x = re.findall("[A-Z][A-Za-z]+\.?|\\b[A-Z](?:[\\.&]?[A-Z]){2,}\\b", txt)

or, if the (...) were significant (or you want them for clarity):

x = re.findall("(?:[A-Z][A-Za-z]+\.?)|(?:\\b[A-Z](?:[\\.&]?[A-Z]){2,}\\b)", txt)

Either of these returns the value: ['USA.', 'U.S.A', 'America.']

Upvotes: 1

Related Questions