how to write a PLY grammar to parse paths?

I'm trying to write a grammar with PLY that will parse paths in a file. I'm running into shift reduce conflicts and I'm not sure how to change the grammar to fix it. Here's an example of the file I'm trying to parse. The path/filename can be any acceptable linux path.

file : ../../dir/filename.txt
file : filename.txt
file : filename

So here is the grammar that I wrote.

header : ID COLON path

path : pathexpr filename

pathexpr : PERIOD PERIOD DIVIDE pathexpr
           | PERIOD DIVIDE pathexpr
           | ID DIVIDE pathexpr 
           |
filename : ID PERIOD ID
           | ID

Here are my tokens. I am using the PLY included ctokens library. Just to save effort in writing my own.

t_ID = r'[A-Za-z_][A-Za-z0-9_]*'
t_PERIOD = r'\.'
t_DIVIDE = r'/'
t_COLON = r':'

So I believe there is a shift reduce conflict in the "filename" rule because the parser doesn't know whether to reduce the token to "ID" or to shift for "ID PERIOD ID". I think there is another issue with the case of no path ("filename") where it will consume the token in pathexpr instead of reducing to empty.

How can I fix my grammar to handle these cases? Maybe I need to change my tokens?

Upvotes: 1

Answers (3)

Shan

Reputation: 61

I believe this grammar should work and it has an added advantage of being able to recoganize the parts of the path like extension, directory, drive etc. I've not made the parser yet, only this grammar.

fullfilepath : path SLASH filename
path : root
    | root SLASH directories
root : DRIVE
    | PERCENT WIN_DEF_DIR PERCENT
directories : directory
            | directory SLASH directories
directory : VALIDNAME
filename : VALIDNAME
        | VALIDNAME DOT EXTENSION

Upvotes: 0

rici

Reputation: 241931

The simple solution: Use left-recursion instead of right-recursion.

LR parsers (like PLY and yacc) prefer left-recursion, because it avoids having to expand the parser stack. It is also usually closer to the semantics of the expression -- which is useful when you want to actually interpret the language, and not just recognize it -- and it often, as in this case, avoids the need to left-factor.

In this case, for example, each path segment needs to be applied to the preceding pathexpr, by looking for the segment directory inside the currently found directory. The parser action is clear: look up $2 in $1. How do you right the action for the right recursive version?

So, a simple transformation:

header   : ID COLON path

path     : pathexpr filename

pathexpr : pathexpr PERIOD PERIOD DIVIDE
         | pathexpr PERIOD DIVIDE
         | pathexpr ID DIVIDE
         |
filename : ID PERIOD ID
         | ID

Upvotes: 0

PaulMcG

Reputation: 63782

I think you may be using PLY, not pyparsing, looking at those "t_xxx" names. But here is a pyparsing solution to your problem, see below with helpful comments:

"""
header : ID COLON path

path : pathexpr filename

pathexpr : PERIOD PERIOD DIVIDE pathexpr
           | PERIOD DIVIDE pathexpr
           | ID DIVIDE pathexpr 
           |
filename : ID PERIOD ID
           | ID 
"""

from pyparsing import *

ID = Word(alphanums)
PERIOD = Literal('.')
DIVIDE = Literal('/')
COLON = Literal(':') 

# move this to the top, so we can reference it in a negative
# lookahead while parsing the path
file_name = ID + Optional(PERIOD + ID)

# simple path_element - not sufficient, as it will consume 
# trailing ID that should really be part of the filename
path_element = PERIOD+PERIOD | PERIOD | ID

# more complex path_element - adds lookahead to avoid consuming
# filename as a part of the path
path_element = (~(file_name + WordEnd())) + (PERIOD+PERIOD | PERIOD | ID)

# use repetition for these kind of expressions, not recursion
path_expr = path_element + ZeroOrMore(DIVIDE + path_element)

# use Combine so that all the tokens will get returned as a
# contiguous string, not as separate path_elements and slashes
path = Combine(Optional(path_expr + DIVIDE) + file_name)

# define header - note the use of results names, which will allow
# you to access the separate fields by name instead of by position
# (similar to using named groups in regexp's)
header = ID("id") + COLON + path("path")

tests = """\
file: ../../dir/filename.txt
file: filename.txt
file: filename""".splitlines()

for t in tests:
    print t
    print header.parseString(t).dump()
    print

prints

file: ../../dir/filename.txt
['file', ':', '../../dir/filename.txt']
- id: file
- path: ../../dir/filename.txt

file: filename.txt
['file', ':', 'filename.txt']
- id: file
- path: filename.txt

file: filename
['file', ':', 'filename']
- id: file
- path: filename

Upvotes: 0

how to write a PLY grammar to parse paths?

Answers (3)

Related Questions