Reputation: 57741
I'm trying to write a parser for a string which represents a file path, optionally following by a colon (:
) and a string representing access flags (e.g. r+
or w
). The file name can itself contain colons, e.g., foo:bar.txt
, so the colon separating the access flags should be the last colon in the string.
Here is my implementation so far:
import re
def parse(string):
SCHEME = r"file://" # File prefix
PATH_PATTERN = r"(?P<path>.+)" # One or more of any character
FLAGS_PATTERN = r"(?P<flags>.+)" # The letters r, w, a, b, a '+' symbol, or any digit
# FILE_RESOURCE_PATTERN = SCHEME + PATH_PATTERN + r":" + FLAGS_PATTERN + r"$" # This makes the first test pass, but the second one fail
FILE_RESOURCE_PATTERN = SCHEME + PATH_PATTERN + optional(r":" + FLAGS_PATTERN) + r"$" # This makes the second test pass, but the first one fail
tokens = re.match(FILE_RESOURCE_PATTERN, string).groupdict()
return tokens['path'], tokens['flags']
def optional(re):
'''Encloses the given regular expression in a group which matches 0 or 1 repetitions.'''
return '({})?'.format(re)
I've tried the following tests:
import pytest
def test_parse_file_with_colon_in_file_name():
assert parse("file://foo:bar.txt:r+") == ("foo:bar.txt", "r+")
def test_parse_file_without_acesss_flags():
assert parse("file://foobar.txt") == ("foobar.txt", None)
if __name__ == "__main__":
pytest.main([__file__])
The problem is that by either using or not using optional
, I can make one or the other test pass, but not both. If I make r":" + FLAGS_PATTERN
optional, then preceding regular expression consumes the entire string.
How can I adapt the parse
method to make both tests pass?
Upvotes: 2
Views: 1251
Reputation: 144
Just for fun, I wrote this parse function, which I think is better than using RE?
def parse(string):
s = string.split('//')[-1]
try:
path, flags = s.rsplit(':', 1)
except ValueError:
path, flags = s.rsplit(':', 1)[0], None
return path, flags
Upvotes: 1
Reputation: 627262
You should build the regex like
^file://(?P<path>.+?)(:(?P<flags>[^:]+))?$
See the regex demo.
In your code, ^
anchor is not necessary as you are using re.match
anchoring the match at the start of the string. The path
group matches any 1+ chars lazily (thus, all the text that can be matched with Group 2 will land in the second capture), up to the first occurrence of :
followed with 1+ chars other than :
(if present) and then end of string position is tested. Thanks to $
anchor, the first group will match the whole string if the second optional group is not matched.
Use the following fix:
PATH_PATTERN = r"(?P<path>.+?)" # One or more of any character
FLAGS_PATTERN = r"(?P<flags>[^:]+)" # The letters r, w, a, b, a '+' symbol, or any digit
See the online Python demo.
Upvotes: 2