PLY: parsing line-oriented grammar

Question

I need to parse a (relatively) simple, line oriented language (I didn't invent the language itself, it is the definition language for PlantUML graphs).

My test input is quite simple:

@startuml
Alice -> Bob: Authentication Request
Bob --> Alice: Authentication Response
Alice -> Bob: Another authentication Request
Alice <-- Bob: another authentication Response
@enduml

Problem arises because whatever is after the colon (':') should be treated as a (possibly escaped) string till first EOL ('\n') completely disregarding possible internal punctuation.

NOTE: what follows is just an excerpt of the grammar for the sake of simplicity, I have no problem posting the full test program, if deemed useful.

tokens = (
    'BEGIN', 'END', 'START', 'STATE', 'RARROW2', 'RARROW1', 'LARROW2', 'LARROW1',
    'IDENT', 'COLON', 'NUMBER', 'BSCRIPT', 'ESCRIPT', 'ENDLINE', 'FULLINE', 'newline'
)

literals = '{:}'

t_BEGIN = r"@startuml"
t_END = r"@enduml"
t_START = r"$$\*$$"
t_RARROW2 = r"-->"
t_RARROW1 = r"->"
t_LARROW2 = r"<--"
t_LARROW1 = r"<-"
t_BSCRIPT = r"/'--"
t_ESCRIPT = r"--'/"
t_ENDLINE = r'.+'
t_FULLINE = r'^.*\n'

def t_IDENT(t):
    r"""[a-zA-Z_][a-zA-Z0-9_]*"""
    return t

t_ignore = " \t"


def t_newline(t):
    r"""\n+"""
    t.lexer.lineno += t.value.count("\n")
    return t

def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

def p_diagram(p):
    """diagram : begin diags end"""

def p_begin(p):
    """begin : BEGIN newline"""

def p_end(p):
    """end : END newline"""

def p_diags1(p):
    """diags : diag"""

def p_diags2(p):
    """diags : diags diag"""

def p_diag_t(p):
    """diag : tranc"""

def p_tranc1(p):
    """tranc : trans newline"""

def p_tranc2(p):
    """tranc : trans ':' ENDLINE newline"""

def p_transr(p):
    """trans : node rarrow node"""

def p_transl(p):
    """trans : node larrow node"""

def p_node(p):
    """node : IDENT
            | START"""

def p_rarrow(p):
    """rarrow : RARROW1
              | RARROW2"""
    p[0] = p[1]
    print("rarrow : (%s)" % p[1])


def p_larrow(p):
    """larrow : LARROW1
              | LARROW2"""

First error I get is: Syntax error at ': Authentication Request'

Parser debug output is:

   yacc.py: 360:PLY: PARSE DEBUG START
   yacc.py: 408:
   yacc.py: 409:State  : 0
   yacc.py: 433:Stack  : . LexToken(BEGIN,'@startuml',1,0)
   yacc.py: 443:Action : Shift and goto state 2
   yacc.py: 408:
   yacc.py: 409:State  : 2
   yacc.py: 433:Stack  : BEGIN . LexToken(newline,'\n',1,9)
   yacc.py: 443:Action : Shift and goto state 11
   yacc.py: 408:
   yacc.py: 409:State  : 11
   yacc.py: 433:Stack  : BEGIN newline . LexToken(IDENT,'Alice',2,10)
   yacc.py: 469:Action : Reduce rule [begin -> BEGIN newline] with ['@startuml','\n'] and goto state 1
   yacc.py: 504:Result :  (None)
   yacc.py: 408:
   yacc.py: 409:State  : 1
   yacc.py: 433:Stack  : begin . LexToken(IDENT,'Alice',2,10)
   yacc.py: 443:Action : Shift and goto state 8
   yacc.py: 408:
   yacc.py: 409:State  : 8
   yacc.py: 433:Stack  : begin IDENT . LexToken(RARROW1,'->',2,16)
   yacc.py: 469:Action : Reduce rule [node -> IDENT] with ['Alice'] and goto state 10
   yacc.py: 504:Result :  ([[Alice]])
   yacc.py: 408:
   yacc.py: 409:State  : 10
   yacc.py: 433:Stack  : begin node . LexToken(RARROW1,'->',2,16)
   yacc.py: 443:Action : Shift and goto state 20
   yacc.py: 408:
   yacc.py: 409:State  : 20
   yacc.py: 433:Stack  : begin node RARROW1 . LexToken(IDENT,'Bob',2,19)
   yacc.py: 469:Action : Reduce rule [rarrow -> RARROW1] with ['->'] and goto state 22
   yacc.py: 504:Result :  ('->')
   yacc.py: 408:
   yacc.py: 409:State  : 22
   yacc.py: 433:Stack  : begin node rarrow . LexToken(IDENT,'Bob',2,19)
   yacc.py: 443:Action : Shift and goto state 8
   yacc.py: 408:
   yacc.py: 409:State  : 8
   yacc.py: 433:Stack  : begin node rarrow IDENT . LexToken(ENDLINE,': Authentication Request',2,22)
   yacc.py: 578:Error  : begin node rarrow IDENT . LexToken(ENDLINE,': Authentication Request',2,22)
   yacc.py: 408:
   yacc.py: 409:State  : 8
   yacc.py: 433:Stack  : begin node rarrow IDENT . LexToken(newline,'\n',2,46)
   yacc.py: 469:Action : Reduce rule [node -> IDENT] with ['Bob'] and goto state 26
   yacc.py: 504:Result :  ([[Bob]])
   yacc.py: 408:
   yacc.py: 409:State  : 26
   yacc.py: 433:Stack  : begin node rarrow node . LexToken(newline,'\n',2,46)
   yacc.py: 469:Action : Reduce rule [trans -> node rarrow node] with [[[Alice]],'->',[[Bob]]] and goto state 9
   yacc.py: 504:Result :  ([[Alice]] --> [[Bob]])
   yacc.py: 408:
   yacc.py: 409:State  : 9
   yacc.py: 433:Stack  : begin trans . LexToken(newline,'\n',2,46)
   yacc.py: 443:Action : Shift and goto state 16
   yacc.py: 408:
   yacc.py: 409:State  : 16
   yacc.py: 433:Stack  : begin trans newline . LexToken(IDENT,'Bob',3,47)
   yacc.py: 469:Action : Reduce rule [tranc -> trans newline] with [,'\n'] and goto state 4
   yacc.py: 504:Result :  ([[Alice]] --> [[Bob]])
   yacc.py: 408:
   yacc.py: 409:State  : 4
   yacc.py: 433:Stack  : begin tranc . LexToken(IDENT,'Bob',3,47)
   yacc.py: 469:Action : Reduce rule [diag -> tranc] with [] and goto state 5
   yacc.py: 504:Result :  ([[Alice]] --> [[Bob]])
   yacc.py: 408:
   yacc.py: 409:State  : 5
   yacc.py: 433:Stack  : begin diag . LexToken(IDENT,'Bob',3,47)
   yacc.py: 469:Action : Reduce rule [diags -> diag] with [] and goto state 6
   yacc.py: 504:Result :  ([[[Alice]] --> [[Bob]]])
   yacc.py: 408:
   yacc.py: 409:State  : 6
   yacc.py: 433:Stack  : begin diags . LexToken(IDENT,'Bob',3,47)
   yacc.py: 443:Action : Shift and goto state 8
   yacc.py: 408:
   yacc.py: 409:State  : 8
   yacc.py: 433:Stack  : begin diags IDENT . LexToken(RARROW2,'-->',3,51)
   yacc.py: 469:Action : Reduce rule [node -> IDENT] with ['Bob'] and goto state 10
   yacc.py: 504:Result :  ([[Bob]])
   yacc.py: 408:
   yacc.py: 409:State  : 10
   yacc.py: 433:Stack  : begin diags node . LexToken(RARROW2,'-->',3,51)
   yacc.py: 443:Action : Shift and goto state 21
   yacc.py: 408:
   yacc.py: 409:State  : 21
   yacc.py: 433:Stack  : begin diags node RARROW2 . LexToken(IDENT,'Alice',3,55)
   yacc.py: 469:Action : Reduce rule [rarrow -> RARROW2] with ['-->'] and goto state 22
   yacc.py: 504:Result :  ('-->')
   yacc.py: 408:
   yacc.py: 409:State  : 22
   yacc.py: 433:Stack  : begin diags node rarrow . LexToken(IDENT,'Alice',3,55)
   yacc.py: 443:Action : Shift and goto state 8
   yacc.py: 408:
   yacc.py: 409:State  : 8
   yacc.py: 433:Stack  : begin diags node rarrow IDENT . LexToken(ENDLINE,': Authentication Response',3,60)
   yacc.py: 578:Error  : begin diags node rarrow IDENT . LexToken(ENDLINE,': Authentication Response',3,60)
   yacc.py: 408:
   yacc.py: 409:State  : 8
   yacc.py: 433:Stack  : begin diags node rarrow IDENT . LexToken(newline,'\n',3,85)
   yacc.py: 469:Action : Reduce rule [node -> IDENT] with ['Alice'] and goto state 26
   yacc.py: 504:Result :  ([[Alice]])
   yacc.py: 408:

As you can see the token after the second IDENT('Bob') is a ENDLINE(': Authentication Request') which includes the colon as first char and thus throws completely out of gear the parser.

What is the advised fix for this?

rici · Accepted Answer

That this lexer works even a little bit is a consequence of the peculiar order in which Ply applies lexical rules. [Note 1]

Lexical analysis is simplest when you can analyse the input into a sequence of lexemes where a lexeme can be identified without any consideration of previous lexemes. That's the default model for pretty well any tokeniser framework. In that model, a lexical pattern defined as "anything up to the end of the line" is always going to be applicable, which means that your input would be analysed into newlines and rest-of-lines. That's probably not what you wanted.

It seems like the lexeme is actually "a colon followed by the rest of the line", so there is no point separating the colon and the rest of the line into two tokens. If that's the case, then the pattern is really easy to write: r':.*'. (If colons are used somewhere else for some other purpose, this won't work. Hopefully, they aren't.)

If you separated the colon and the rest of the line into two tokens in order that the colon not be part of the value of the matched token, then you could achieve the same effect by modifying t.value inside the :.* token function.

Notes:

Ply checks patterns in the following order:
- First, patterns from token functions in the order the functions are defined in the file
- Second, patterns from token variables, in reverse order by length (that is, longest to shortest).
Since the pattern .* is longer than the pattern :, it will be tried first, and thus the colon will never be recognised. It was, I believe, pure luck that -> was matched before .*. The ordering of patterns by length should not be relied on for patterns with the same length.

On the whole, it is better to use one of the following strategies:
- Only use token functions and order them manually in the correct order.
- Use token variables only for patterns which are unambiguous.

PLY: parsing line-oriented grammar

Answers (1)

Notes:

Related Questions