Kurt Peek
Kurt Peek

Reputation: 57741

How to make a regular expression 'greedy but optional'

I'm trying to write a parser for a string which represents a file path, optionally following by a colon (:) and a string representing access flags (e.g. r+ or w). The file name can itself contain colons, e.g., foo:bar.txt, so the colon separating the access flags should be the last colon in the string.

Here is my implementation so far:

import re

def parse(string):
    SCHEME = r"file://"                             # File prefix
    PATH_PATTERN = r"(?P<path>.+)"                  # One or more of any character
    FLAGS_PATTERN = r"(?P<flags>.+)"        # The letters r, w, a, b, a '+' symbol, or any digit

    # FILE_RESOURCE_PATTERN = SCHEME + PATH_PATTERN + r":" + FLAGS_PATTERN + r"$"               # This makes the first test pass, but the second one fail
    FILE_RESOURCE_PATTERN = SCHEME + PATH_PATTERN + optional(r":" + FLAGS_PATTERN) + r"$"   # This makes the second test pass, but the first one fail

    tokens = re.match(FILE_RESOURCE_PATTERN, string).groupdict()

    return tokens['path'], tokens['flags']

def optional(re):
    '''Encloses the given regular expression in a group which matches 0 or 1 repetitions.'''
    return '({})?'.format(re)

I've tried the following tests:

import pytest

def test_parse_file_with_colon_in_file_name():
    assert parse("file://foo:bar.txt:r+") == ("foo:bar.txt", "r+")

def test_parse_file_without_acesss_flags():
    assert parse("file://foobar.txt") == ("foobar.txt", None)

if __name__ == "__main__":
    pytest.main([__file__])

The problem is that by either using or not using optional, I can make one or the other test pass, but not both. If I make r":" + FLAGS_PATTERN optional, then preceding regular expression consumes the entire string.

How can I adapt the parse method to make both tests pass?

Upvotes: 2

Views: 1251

Answers (2)

A.Sherif
A.Sherif

Reputation: 144

Just for fun, I wrote this parse function, which I think is better than using RE?

def parse(string):
    s = string.split('//')[-1]
    try:
        path, flags = s.rsplit(':', 1)
    except ValueError:
        path, flags = s.rsplit(':', 1)[0], None
    return path, flags

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627262

You should build the regex like

^file://(?P<path>.+?)(:(?P<flags>[^:]+))?$

See the regex demo.

In your code, ^ anchor is not necessary as you are using re.match anchoring the match at the start of the string. The path group matches any 1+ chars lazily (thus, all the text that can be matched with Group 2 will land in the second capture), up to the first occurrence of : followed with 1+ chars other than : (if present) and then end of string position is tested. Thanks to $ anchor, the first group will match the whole string if the second optional group is not matched.

Use the following fix:

PATH_PATTERN = r"(?P<path>.+?)"                  # One or more of any character
FLAGS_PATTERN = r"(?P<flags>[^:]+)"        # The letters r, w, a, b, a '+' symbol, or any digit

See the online Python demo.

Upvotes: 2

Related Questions