Reputation: 1813
The string s
is bigger, but I have shortened it to simplify.
>>> import re
>>> s = "Blah. Tel.: 555 44 33 22."
>>> m = re.search(r"\s*Tel\.:\s*(?P<telephone>.+?)\.", s)
>>> m.group("telephone")
'555 44 33 22'
The code above works, but if I wrap the regex in ()?
to make it optional, I don't get any telephone.
>>> m = re.search(r"(\s*Tel\.:\s*(?P<telephone>.+?)\.)?", s)
>>> m
<_sre.SRE_Match object at 0x9369890>
>>> m.group("telephone")
What's the problem here? Thanks!
Edit:
This is part of a larger regular expression in which I'm getting many values from every line of a big file.
regex = r"^(?P<title>.[^(]+);" \
"\s*(?P<subtitle>.+)\." \
"\s*Tel\.:\s*(?P<telephone>.+?)(\.|;)" \
"\s*(?P<url>(www\.|http://).+?\.[a-zA-Z]+)(\.|;)" \
"(\s*(?P<text>.+?)\.)?" \
"\s*coor:(\s*(?P<lat>.+?),\s*(?P<long>.+?))?$"
One sample line could be:
l = "Title title; Subtitle, subtitle. Tel.: 555 33 44 11. www.url.com. coor: 11.11111, -2.222222
And other sample line:
l = "Title2 title; Subtitle2, subtitle. Tel.: 555 33 44 11. www.url2.com. coor: 44.444444, -6.66666
It's a really big regex, so that's why I didn't post it.
Upvotes: 2
Views: 552
Reputation: 336108
Your regex is too unspecific in what the title
and subtitle
bits are matching. They are gobbling up the telephone part, and if that is made optional, it continues at the next part of the regex (and succeeds). Only if it's not optional, the regex engine has to backtrack so it can find an overall match.
Try
regex = r"^(?P<title>[^;]+);" \
"\s*(?P<subtitle>[^.]+)\." \
"(\s*Tel\.:\s*(?P<telephone>.+?)(\.|;))?" \
"\s*(?P<url>(www\.|http://).+?\.[a-zA-Z]+)(\.|;)" \
"(\s*(?P<text>.+?)\.)?" \
"\s*coor:(\s*(?P<lat>.+?),\s*(?P<long>.+?))?$"
Upvotes: 0
Reputation: 212835
(anything)?
matches the zero string at the very beginning of your string (before Blah
), so it is happy and does not bother searching further.
EDIT:
If you have many lines and only some of them contain the wanted string, try the following:
import re
rex = re.compile(r"\s*Tel\.:\s*(?P<telephone>.+?)\.")
for line in lines:
m = rex.search(line)
if m:
print m.group("telephone")
Upvotes: 2
Reputation: 500227
This is because an empty string is a valid match for your regular expression, and is preferred over the longer match.
You might want to take a look at re.findall
.
edit: You can move the optionality out of your regular expression altogether:
import re
s = "Blah. Tel.: 555 44 33 22."
m = re.search(r"\s*Tel\.:\s*(?P<telephone>.+?)\.", s)
if m is not None:
print m.group("telephone")
Upvotes: 2