I'm trying to write a parser using ply in which the \n is sometimes syntactically important, and sometimes has to be ignored. More precisely in the language I would like to parse there are some lines corresponding to definitions that must end with a \n to indicate the end of the definition. In all other cases, the \n has to be ignored and is just useful to count lines in the input file. For instance : def1 def2 def3 def4 Would be valid, but : def1 def2 def3 def 4 Wouldn't since each def must end with a \n What I want is a bit similar to what we have in Python in which we can write : def a(b): if b==0: return (b+1) or def a(b): if b==0: return (b+1) but def a(b): if b==0: return (b+1) is not allowed. The \n is necessary to indicate the end of a statement but has no effect on code otherwise. And I don't know to reproduce a such behaviour with ply. If I define a token like so : def t_NEWLINE(self,t): r'\n+' t.lexer.lineno += len(t.value) return t No \n would be allowed unless the grammar explicitly allows this token to be inserted almost everywhere. I thought about contextual grammar but there's a single context in my case. I just would like to be able to use \n both as a token is certain rules and ignored otherwise. Is there any way of doing this ?

Reputation: 2619

Ply Lex Yacc : treat \n as a token in some rules and ignore it otherwise

I'm trying to write a parser using ply in which the \n is sometimes syntactically important, and sometimes has to be ignored.

More precisely in the language I would like to parse there are some lines corresponding to definitions that must end with a \n to indicate the end of the definition. In all other cases, the \n has to be ignored and is just useful to count lines in the input file.

For instance :

def1 
def2


def3

def4

Would be valid, but :

def1 
def2 def3

def 4

Wouldn't since each def must end with a \n

What I want is a bit similar to what we have in Python in which we can write :

def a(b):
    if b==0:
        return (b+1)

def a(b):



    if b==0:

        return (b+1)

but

def a(b): if b==0: return (b+1)

is not allowed. The \n is necessary to indicate the end of a statement but has no effect on code otherwise.

And I don't know to reproduce a such behaviour with ply. If I define a token like so :

def t_NEWLINE(self,t):
    r'\n+'
    t.lexer.lineno += len(t.value)
    return t

No \n would be allowed unless the grammar explicitly allows this token to be inserted almost everywhere.

I thought about contextual grammar but there's a single context in my case. I just would like to be able to use \n both as a token is certain rules and ignored otherwise.

Is there any way of doing this ?

Upvotes: 2

Answers (1)

rici

Reputation: 241791

Since Ply gives you the power of a Turing-complete programming language (Python), there certainly will be a way. However, it's impossible to provide much of a solution without knowing anything about the specifics of the problem.

Lexical analysis of Python itself does require a more sophisticated strategy, which does include a small state machine (basically to eliminate newlines inside brackets, where they are ignored). Note that even simple Python statements must be terminated either with a newline or a semicolon, so the terminator is definitely in the grammar. Typical Python lexical analysers ignore comments and blank lines; I could provide an example, but I don't know that it is relevant here since your language is apparently only "a bit similar to what we have in Python".

So I've gone out on a limb here to try to think up a use case which fits the very broad description in your question and which is relatively easy to solve in Ply. I accept that it might have no relevance at all to your use case, but it might serve for some future reader with a different but similar requirement.

It's actually pretty rare to encounter a language in which statements do not require any form of termination, although it is certainly not impossible. For example, a typical language which includes

statements which end with an expression (or expression statements, like function calls),
statements which begin with an expression, again including expression statements but also keywordless assignment (a = b rather than, for example, let a = b),
dual-purpose parentheses representing both grouping and function call arguments.

will be ambiguous unless statements have a definite terminator. (a(b) could be a function call (one statement) or two consecutive expression statements; similar examples can be constructed for most languages which have the above characteristics.

Still, all that could be surmounted with language design. The easiest such design would be to require that all statements, even function calls and assignments, start with a keyword. Presumably in such a language, definition statements also start with a keyword and the only reason to insist on newlines around definitions is aesthetic. (But aesthetics is fine as a reason. It's aesthetics rather than parsing limitations which leads to the Python one-line definition in your question being illegal.)

Suppose then, we have a language with definition statements starting with the keyword whose symbol is DEF and ending with the symbol END (otherwise, we won't know where the definition ends). And we'll also assume assignment statements starting with the keyword LET, which require no explicit termination. (Of course there will be other statement types, but they will follow the same pattern as LET.) For whatever reason, we want to ensure that a DEF is always the first token on a line and an END is always the last token, which will guarantee that a definition does not horizontally coexist with any other statement, although we're comfortable with LET a = b LET c = 3.

One way to do this would be to ignore newlines except for the ones which precede DEF or follow END. We'd then write a grammar which included:

lines       : #empty
            | lines line NEWLINE
line        : #empty
            | line simple_stmt
            | definition
definition  : DEF prototype lines END
simple_stmt : LET lhs '=' rhs

Note that the above grammar requires that the program either be empty or end with a NEWLINE.

Now, to filter out the unimportant NEWLINE tokens, we can use a wrapper class around the Ply-generated lexer. The constructor for the wrapper takes a lexer as an argument, and filters the output stream from that lexer by removing NEWLINE tokens which are considered unimportant. We also ensure that the input ends with a NEWLINE unless it is empty, by fabricating a NEWLINE token if necessary. (That wasn't really part of the question, but it simplifies the grammar.)

# Used to fabricate a token object.
from types import SimpleNamespace

class LexerWrapper(object):

  def __init__(self, lexer):
    """Create a new wrapper given the lexer which is being wrapped"""
    self.lexer = lexer
    # None, or tuple containing queued token.
    # Using a tuple allows None (eof) to be queued.
    self.pending = None
    # Previous token or None
    self.previous = None

  def token(self):
    """Return the next token, or None if end of input has been reached"""
    # If there's a pending token, send it
    if self.pending is not None:
      t = self.pending[0]
      self.pending = None
      return t
    # Get the next (useful) token
    while True
      t = self.lexer.token()
      # Make sure that we send a NEWLINE before EOF
      if t is None:
        t, self.previous = self.previous, None
        self.pending = (None,)
        if t is not None and t.type != 'NEWLINE':
          # Manufacture a NEWLINE token if necessary
          t = SimpleNamespace( type='NEWLINE'
                             , value='\n'
                             , lineno=self.lexer.lineno
                             )
        return t
      elif t.type == 'NEWLINE':
        if self.previous is None or self.previous.type == 'NEWLINE':
          # Get another token
          continue
        if self.previous.type == 'END':
          # Use this NEWLINE if it follows END
          self.previous = None
        else:
          # Get another token
          self.previous = t
          continue
      else:
        self.previous = t
      return t

Upvotes: 2

Ply Lex Yacc : treat \n as a token in some rules and ignore it otherwise

Answers (1)

Related Questions