Reputation: 1210
I imagine that this question is basic enough that an answer must already exist, but my google-fu skills must be lacking.
I need to parse strings with the following format: upper:lower cc ; ! comment
. The character %
is used to escape special characters %:; !
. The :
character delimits upper
from lower
. The ;
character terminates a line. The space character is used to delimit the cc
element. Comments are introduced using !
. The following strings should be parsed as shown:
a:b c ; upper="a" lower="b" cc="c" comment=""
a%::b c ; upper="a:" lower="b" cc="c" comment=""
a%%:b c ; ! x upper="a%" lower="b" cc="c" comment=" x"
a%!:b c ; ! x upper="a!" lower="b" cc="c" comment=" x"
a%%%::b c ; upper="a%:" lower="b" cc="c" comment=""
What is the most pythonic (i.e. simple, readable, elegant) and robust way to approach this task in python? Are regular expressions suitable?
I tried writing a regular expression that used a negative lookbehind to detect an odd number of %
s before the :
, but apparently lookbehinds cannot be of variable length.
Upvotes: 2
Views: 99
Reputation: 1210
Based on the comment by @MichaelButscher, I wrote the following solution using regular expressions:
def parse_line(line):
parsed = re.match(r'''( (?: %. | [^:] )+ ) # capture upper
(?: : # colon delimiter
( (?: %. | [^ ] )+ ) # capture lower
)? # :lower is optional
\ + # space delimiter(s)
( (?: %. | [^ ;] )+ ) # capture cont class
\ +; # space delimiter(s)
( .* ) \s* $ # capture comment''',
line, re.X)
groups = parsed.groups(default='')
groups = [re.sub('%(.)', r'\1', elem) for elem in groups] # unescape
return groups
This yields the following results:
>>> print(parse_line("a:b c ;"))
['a', 'b', 'c', '']
>>> print(parse_line("a%::b c ;"))
['a:', 'b', 'c', '']
>>> print(parse_line("a%%:b c ; ! x"))
['a%', 'b', 'c', ' ! x']
>>> print(parse_line("a%!:b c ; ! x"))
['a!', 'b', 'c', ' ! x']
Malformed entries return NoneType
object.
Upvotes: 0
Reputation: 82889
Similar to the answer from AKX, but I already had this ready when I saw it. Also, the approach is bit different (easier to adapt to a different format) and the result might be slightly cleaner, too.
def parse(line):
parts = [""]
delims = ": ; !"
escape = False
for c in line:
if escape:
parts[-1] += c
escape = False
elif c == "%":
escape = True
elif c == delims[:1]:
parts += [""]
delims = delims[1:]
else:
parts[-1] += c
return [p for p in parts if p] if ";" not in delims else None
lines = ["a:b c ;","a%::b c ;","a%%:b c ; ! x","a%!:b c ; ! x","a%%%::b c ;","a:b incomplete"]
for line in lines:
print(line, "\t", parse(line))
Basically, this iterates the line character by character, keeps track of "escape mode", and checks the current char with the next expected delimiter.
Output:
a:b c ; ['a', 'b', 'c']
a%::b c ; ['a:', 'b', 'c']
a%%:b c ; ! x ['a%', 'b', 'c', ' x']
a%!:b c ; ! x ['a!', 'b', 'c', ' x']
a%%%::b c ; ['a%:', 'b', 'c']
a:b incomplete None
Upvotes: 1
Reputation: 168824
I don't think regexps can reliably capture the escaping state. Here's a state-machine style parser.
def parse_line(s):
fields = [""]
in_escape = False
for i, c in enumerate(s):
if not in_escape:
if c == "%": # Start of escape
in_escape = True
continue
if (len(fields) == 1 and c == ":") or (len(fields) == 2 and c == " "): # Next field
fields.append("")
continue
if c == ";": # End-of-line
break
fields[-1] += c # Regular or escaped character
in_escape = False
return (fields, s[i + 1:])
print(parse_line("a:b c ;"))
print(parse_line("a%::b c ;"))
print(parse_line("a%%:b c ; ! x"))
print(parse_line("a%!:b c ; ! x"))
print(parse_line("a%%%::b c defgh:!:heh;"))
print(parse_line("a%;"))
print(parse_line("a%;:b!unterminated-line"))
outputs
(['a', 'b', 'c '], '')
(['a:', 'b', 'c '], '')
(['a%', 'b', 'c '], ' ! x')
(['a!', 'b', 'c '], ' ! x')
(['a%:', 'b', 'c defgh:!:heh'], '')
(['a;'], '')
(['a;', 'b!unterminated-line'], '')
i.e. the retval is a 2-tuple of parsed fields, and the rest of the line after the ;
marker (which may or may not contain a comment).
Upvotes: 3