jmq
jmq

Reputation: 1591

Python regular expression doesn't match group when ORed with other expressions

I've been trying to debug a strange problem with regular expressions. I've reduced it down to a simple case here. I'm checking a string for any of four regular expressions. My string matches the regular expression but the grouping value in parenthesis that I'm trying to extract doesn't work with what I have coded up. What I can't figure out is that if I just search for one expression both the entire string that i'm trying to match on and the value I want to extract are calculated correctly. However, when I search using all four expressions the string still matches [group(0)] but I don't get the field that I need [group(1)].

#!/usr/bin/python3

import re

data = '<w:t xml:space="preserve">More </w:t>'

text = re.search("<w:p>|<w:p .*?>|<w:t>(.*?)</w:t>|<w:t .*?>(.*?)</w:t>", data)
print("First RE")
print("group(0) " + text.group(0))
try:
    print("group(1) " + text.group(1))
except:
    pass

print("Second RE")
text = re.search("<w:t .*?>(.*?)</w:t>", data)
print("group(0) " + text.group(0))
try:
    print("group(1) " + text.group(1))
except:
    pass

When I run it I get this result:

First RE
group(0) <w:t xml:space="preserve">More </w:t>
Second RE
group(0) <w:t xml:space="preserve">More </w:t>
group(1) More 

I would expect both regular expressions to return the same results. Could someone explain why they don't? According to the documentation the OR "|" has a low precedence so i'm not why/if the other regular expressions are impacting it. Thanks!

Upvotes: 0

Views: 1591

Answers (1)

Chris Doyle
Chris Doyle

Reputation: 11992

Your first regex has two capture groups in it and your second regex only has one. In your first regex your using or's so when it reches the expression with the first capture group, it doesnt match so this capture group is empty, your second capture group matches so the value is stored in your 2nd capture group.

So after the first regex runs the first capture group is empty and the second is populated.

import re

data = '<w:t xml:space="preserve">More </w:t>'
text = re.search("<w:p>|<w:p .*?>|<w:t>(.*?)</w:t>|<w:t .*?>(.*?)</w:t>", data)
print("First RE")
print(text.groups())
print("Second RE")
text = re.search("<w:t .*?>(.*?)</w:t>", data)
print(text.groups())

OUTPUT

First RE
(None, 'More ')
Second RE
('More ',)

So your issue is your only looking at the first capture group but in your first regex that capture group is empty. So when your inside the try block, your trying to concatenate "group(1) with the value from the first capture group. However you can only concat two strings and the type of your value in the first capture groups is None so this would trigger an exception TypeError: can only concatenate str (not "NoneType") to str which you then catch and ignore.

thats why you dont see the print.

Upvotes: 2

Related Questions