Menda
Menda

Reputation: 1813

Optional string not matched in Regular Expression

The string s is bigger, but I have shortened it to simplify.

>>> import re
>>> s = "Blah. Tel.: 555 44 33 22."
>>> m = re.search(r"\s*Tel\.:\s*(?P<telephone>.+?)\.", s)
>>> m.group("telephone")
'555 44 33 22'

The code above works, but if I wrap the regex in ()? to make it optional, I don't get any telephone.

>>> m = re.search(r"(\s*Tel\.:\s*(?P<telephone>.+?)\.)?", s)
>>> m
<_sre.SRE_Match object at 0x9369890>
>>> m.group("telephone")

What's the problem here? Thanks!

Edit:

This is part of a larger regular expression in which I'm getting many values from every line of a big file.

regex = r"^(?P<title>.[^(]+);" \
         "\s*(?P<subtitle>.+)\." \
         "\s*Tel\.:\s*(?P<telephone>.+?)(\.|;)" \
         "\s*(?P<url>(www\.|http://).+?\.[a-zA-Z]+)(\.|;)" \
         "(\s*(?P<text>.+?)\.)?" \
         "\s*coor:(\s*(?P<lat>.+?),\s*(?P<long>.+?))?$"

One sample line could be:

l = "Title title; Subtitle, subtitle. Tel.: 555 33 44 11. www.url.com. coor: 11.11111, -2.222222

And other sample line:

l = "Title2 title; Subtitle2, subtitle. Tel.: 555 33 44 11. www.url2.com. coor: 44.444444, -6.66666

It's a really big regex, so that's why I didn't post it.

Upvotes: 2

Views: 552

Answers (3)

Tim Pietzcker
Tim Pietzcker

Reputation: 336108

Your regex is too unspecific in what the title and subtitle bits are matching. They are gobbling up the telephone part, and if that is made optional, it continues at the next part of the regex (and succeeds). Only if it's not optional, the regex engine has to backtrack so it can find an overall match.

Try

regex = r"^(?P<title>[^;]+);" \
         "\s*(?P<subtitle>[^.]+)\." \
         "(\s*Tel\.:\s*(?P<telephone>.+?)(\.|;))?" \
         "\s*(?P<url>(www\.|http://).+?\.[a-zA-Z]+)(\.|;)" \
         "(\s*(?P<text>.+?)\.)?" \
         "\s*coor:(\s*(?P<lat>.+?),\s*(?P<long>.+?))?$"

Upvotes: 0

eumiro
eumiro

Reputation: 212835

(anything)? matches the zero string at the very beginning of your string (before Blah), so it is happy and does not bother searching further.

EDIT:

If you have many lines and only some of them contain the wanted string, try the following:

import re

rex = re.compile(r"\s*Tel\.:\s*(?P<telephone>.+?)\.")
for line in lines:
    m = rex.search(line)
    if m:
        print m.group("telephone")

Upvotes: 2

NPE
NPE

Reputation: 500227

This is because an empty string is a valid match for your regular expression, and is preferred over the longer match.

You might want to take a look at re.findall.

edit: You can move the optionality out of your regular expression altogether:

import re
s = "Blah. Tel.: 555 44 33 22."
m = re.search(r"\s*Tel\.:\s*(?P<telephone>.+?)\.", s)
if m is not None:
  print m.group("telephone")

Upvotes: 2

Related Questions