Use of findall and parenthesis in Python

Question

I need to extract all letters after the + sign or at the beginning of a string like this:

formula = "X+BC+DAF"

I tried so, and I do not want to see the + sign in the result. I wish see only ['X', 'B', 'D'].

>>> re.findall("^[A-Z]|[+][A-Z]", formula)
['X', '+B', '+D']

When I grouped with parenthesis, I got this strange result:

re.findall("^([A-Z])|[+]([A-Z])", formula)
[('X', ''), ('', 'B'), ('', 'D')]

Why it created tuples when I try to group ? How to write the regexp directly such that it returns ['X', 'B', 'D'] ?

Mark Byers · Accepted Answer

If there are any capturing groups in the regular expression then re.findall returns only the values captured by the groups. If there are no groups the entire matched string is returned.

re.findall(pattern, string, flags=0)

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

How to write the regexp directly such that it returns ['X', 'B', 'D'] ?

Instead of using a capturing group you can use a non-capturing group:

>>> re.findall(r"(?:^|\+)([A-Z])", formula)
['X', 'B', 'D']

Or for this specific case you could try a simpler solution using a word boundary:

>>> re.findall(r"\b[A-Z]", formula)
['X', 'B', 'D']

Or a solution using str.split that doesn't use regular expressions:

>>> [s[0] for s in formula.split('+')]
['X', 'B', 'D']

Use of findall and parenthesis in Python

Answers (1)

Related Questions