Ken Williams
Ken Williams

Reputation: 23995

Combine compiled Python regexes

Is there any mechanism in Python for combining compiled regular expressions?

I know it's possible to compile a new expression by extracting the plain-old-string .pattern property from existing pattern objects. But this fails in several ways. For example:

import re

first = re.compile(r"(hello?\s*)")

# one-two-three or one/two/three - but not one-two/three or one/two-three
second = re.compile(r"one(?P<r1>[-/])two(?P=r1)three", re.IGNORECASE)

# Incorrect - back-reference \1 would refer to the wrong capturing group now,
# and we get an error "redefinition of group name 'r1' as group 3; was 
# group 2 at position 47" for the `(?P)` group.
# Result is also now case-sensitive, unlike 'second' which is IGNORECASE
both = re.compile(first.pattern + second.pattern + second.pattern)

The result I'm looking for is achievable like so in Perl:

$first = qr{(hello?\s*)};

# one-two-three or one/two/three - but not one-two/three or one/two-three
$second = qr{one([-/])two\g{-1}three}i;

$both = qr{$first$second$second};

A test shows the results:

test($second, "...one-two-three...");                   # Matches
test($both, "...hello one-two-THREEone-two-three...");  # Matches
test($both, "...hellone/Two/ThreeONE-TWO-THREE...");    # Matches
test($both, "...HELLO one/Two/ThreeONE-TWO-THREE...");  # No match

sub test {
  my ($pat, $str) = @_;
  print $str =~ $pat ? "Matches\n" : "No match\n";
}

Is there a library somewhere that makes this use case possible in Python? Or a built-in feature I'm missing somewhere?

(Note - one very useful feature in the Perl regex above is \g{-1}, which unambiguously refers to the immediately preceding capture group, so that there are no collisions of the type that Python is complaining about when I try to compile the combined expression. I haven't seen that anywhere in Python world, not sure if there's an alternative I haven't thought of.)

Upvotes: 19

Views: 3736

Answers (2)

Pulkit Kansal
Pulkit Kansal

Reputation: 149

Ken, this is an interesting problem. I agree with you that the Perl solution is very slick. I came up with something, but it is not so elegant. Maybe it gives you some idea to further explore the solution using Python. The idea is to simulate the concatenation using Python re methods.

first = re.compile(r"(hello?\s*)")
second = re.compile(r"one(?P<r1>[-/])two(?P=r1)three", re.IGNORECASE)

str="...hello one-two-THREEone/two/three..."
#str="...hellone/Two/ThreeONE-TWO-THREE..."
if re.search(first,str):
    first_end_pos = re.search(first,str).end()
    if re.match(second,str[first_end_pos:]):
        second_end_pos = re.match(second,str[first_end_pos:]).end() + first_end_pos
        if re.match(second,str[second_end_pos:]):
            print ('Matches')

It will work for most of the cases but it is not working for the below case:

...hellone/Two/ThreeONE-TWO-THREE...

So, yes I admit it is not a complete solution to your problem. Hope this helps though.

Upvotes: 3

Jason Pierrepont
Jason Pierrepont

Reputation: 141

I'm not a perl expert, but it doesn't seem like you're comparing apples to apples. You're using named capture groups in python, but I don't see any named capture groups in the perl example. This causes the error you mention, because this

both = re.compile(first.pattern + second.pattern + second.pattern)

tries to create two capture groups named r1

For example, if you use the regex below, then try to access group_one by name, would you get the numbers before "some text" or after?

# Not actually a valid regex
r'(?P<group_one>[0-9]*)some text(?P<group_one>[0-9]*)'

Solution 1

An easy solution is probably to remove the names from the capture groups. Also add the re.IGNORECASE to both. The code below works, although I'm not sure the resulting regex pattern will match what you want it to match.

first = re.compile(r"(hello?\s*)")
second = re.compile(r"one([-/])two([-/])three", re.IGNORECASE)
both = re.compile(first.pattern + second.pattern + second.pattern, re.IGNORECASE)

Solution 2

What I'd probably do instead is define the separate regular expressions as strings, then you can combine them however you'd like.

pattern1 = r"(hello?\s*)"
pattern2 = r"one([-/])two([-/])three"
first = re.compile(pattern1, re.IGNORECASE)
second = re.compile(pattern2, re.IGNORECASE)
both = re.compile(r"{}{}{}".format(pattern1, pattern2, pattern2), re.IGNORECASE)

Or better yet, for this specific example, don't repeat pattern2 twice, just account for the fact that it'll repeat in the regex:

both = re.compile("{}({}){{2}}".format(pattern1, pattern2), re.IGNORECASE)

which gives you the following regex:

r'(hello?\s*)(one([-/])two([-/])three){2}'

Upvotes: 1

Related Questions