Rashad
Rashad

Reputation: 245

Python - Backreferencing a Named Group

I am having trouble understanding how to use a named backreference in python. I want to findall referenced of the months January-March and their abbreviated form (e.g. January, Jan., February, Feb., etc.)

str = 'Bob Martin brought a car on January 20, 1962. On Feb. the 23rd, Bob sold his car. The 21st of March will be fun.'

re.findall('''
       (?P<Month> (Jan(uary|\.)) | (Feb(ruary|\.)) | (Mar(ch|\.))) # Months
     | (?P=Month)\sthe\s\d{2}(rd|st)
     | [Tt]he\s\d{2}(rd|st)\sof(?P=Month)
'''
str, re.X")

Should match:

Janurary

Feb. the 23rd

The 21st of March

Upvotes: 2

Views: 2583

Answers (1)

BrenBarn
BrenBarn

Reputation: 251398

From your example, it looks like you're trying to use groups as a shortcut to avoid writing out one piece of your regex multiple times. That is you want to write an expression like (?P<expr>this|that)|something then (?P=expr) and have it work as if you had written (this|that)|something then (this|that).

But that's not how groups work. Capturing groups (including named groups) capture what is matched, not the expression itself. In your example, if the input text doesn't contain one of the given month names, then the "Month" group will be empty. If it does contain one those, then the group will contain the month name, but your pattern won't use it, because you're using an alternation, so if the first part (the first line of your regex) matches, it won't try the other parts (the second and third lines).

The purpose of backreferences is to match the same text occurring multiple times in the input string, not to repeat a part of the regular expression itself. For instance, something like (a|b) is \1 will match "a is a" or "b is b", but not "a is b". This regex is thus not the same as (a|b) is (a|b), which would also match "a is b".

You can't use backreferences to predefine pieces of a regex. If you want to do that, you'd have to define a separate string and concatenate it into the pattern multiple times. For instance, with my example, you could do letter = r"(a|b)" and then do regex = letter + " is " + letter to get (a|b) is (a|b).

However, doing this can quickly become unwieldy. Regular expressions are not a great tool for representing grammars with lots of mix-and-matchable parts (like the "Month" in your example). For that, you would be better off using a parsing library like parcon.

Upvotes: 4

Related Questions