Reputation: 2619
I'm trying to write a parser using ply in which the \n
is sometimes syntactically important, and sometimes has to be ignored.
More precisely in the language I would like to parse there are some lines corresponding to definitions that must end with a \n
to indicate the end of the definition. In all other cases, the \n
has to be ignored and is just useful to count lines in the input file.
For instance :
def1
def2
def3
def4
Would be valid, but :
def1
def2 def3
def 4
Wouldn't since each def must end with a \n
What I want is a bit similar to what we have in Python in which we can write :
def a(b):
if b==0:
return (b+1)
or
def a(b):
if b==0:
return (b+1)
but
def a(b): if b==0: return (b+1)
is not allowed.
The \n
is necessary to indicate the end of a statement but has no effect on code otherwise.
And I don't know to reproduce a such behaviour with ply. If I define a token like so :
def t_NEWLINE(self,t):
r'\n+'
t.lexer.lineno += len(t.value)
return t
No \n
would be allowed unless the grammar explicitly allows this token to be inserted almost everywhere.
I thought about contextual grammar but there's a single context in my case. I just would like to be able to use \n
both as a token is certain rules and ignored otherwise.
Is there any way of doing this ?
Upvotes: 2
Views: 1688
Reputation: 241791
Since Ply gives you the power of a Turing-complete programming language (Python), there certainly will be a way. However, it's impossible to provide much of a solution without knowing anything about the specifics of the problem.
Lexical analysis of Python itself does require a more sophisticated strategy, which does include a small state machine (basically to eliminate newlines inside brackets, where they are ignored). Note that even simple Python statements must be terminated either with a newline or a semicolon, so the terminator is definitely in the grammar. Typical Python lexical analysers ignore comments and blank lines; I could provide an example, but I don't know that it is relevant here since your language is apparently only "a bit similar to what we have in Python".
So I've gone out on a limb here to try to think up a use case which fits the very broad description in your question and which is relatively easy to solve in Ply. I accept that it might have no relevance at all to your use case, but it might serve for some future reader with a different but similar requirement.
It's actually pretty rare to encounter a language in which statements do not require any form of termination, although it is certainly not impossible. For example, a typical language which includes
a = b
rather than, for example, let a = b
),will be ambiguous unless statements have a definite terminator. (a(b)
could be a function call (one statement) or two consecutive expression statements; similar examples can be constructed for most languages which have the above characteristics.
Still, all that could be surmounted with language design. The easiest such design would be to require that all statements, even function calls and assignments, start with a keyword. Presumably in such a language, definition statements also start with a keyword and the only reason to insist on newlines around definitions is aesthetic. (But aesthetics is fine as a reason. It's aesthetics rather than parsing limitations which leads to the Python one-line definition in your question being illegal.)
Suppose then, we have a language with definition statements starting with the keyword whose symbol is DEF
and ending with the symbol END
(otherwise, we won't know where the definition ends). And we'll also assume assignment statements starting with the keyword LET
, which require no explicit termination. (Of course there will be other statement types, but they will follow the same pattern as LET
.) For whatever reason, we want to ensure that a DEF
is always the first token on a line and an END
is always the last token, which will guarantee that a definition does not horizontally coexist with any other statement, although we're comfortable with LET a = b LET c = 3
.
One way to do this would be to ignore newlines except for the ones which precede DEF
or follow END
. We'd then write a grammar which included:
lines : #empty
| lines line NEWLINE
line : #empty
| line simple_stmt
| definition
definition : DEF prototype lines END
simple_stmt : LET lhs '=' rhs
Note that the above grammar requires that the program either be empty or end with a NEWLINE.
Now, to filter out the unimportant NEWLINE
tokens, we can use a wrapper class around the Ply-generated lexer. The constructor for the wrapper takes a lexer as an argument, and filters the output stream from that lexer by removing NEWLINE tokens which are considered unimportant. We also ensure that the input ends with a NEWLINE unless it is empty, by fabricating a NEWLINE token if necessary. (That wasn't really part of the question, but it simplifies the grammar.)
# Used to fabricate a token object.
from types import SimpleNamespace
class LexerWrapper(object):
def __init__(self, lexer):
"""Create a new wrapper given the lexer which is being wrapped"""
self.lexer = lexer
# None, or tuple containing queued token.
# Using a tuple allows None (eof) to be queued.
self.pending = None
# Previous token or None
self.previous = None
def token(self):
"""Return the next token, or None if end of input has been reached"""
# If there's a pending token, send it
if self.pending is not None:
t = self.pending[0]
self.pending = None
return t
# Get the next (useful) token
while True
t = self.lexer.token()
# Make sure that we send a NEWLINE before EOF
if t is None:
t, self.previous = self.previous, None
self.pending = (None,)
if t is not None and t.type != 'NEWLINE':
# Manufacture a NEWLINE token if necessary
t = SimpleNamespace( type='NEWLINE'
, value='\n'
, lineno=self.lexer.lineno
)
return t
elif t.type == 'NEWLINE':
if self.previous is None or self.previous.type == 'NEWLINE':
# Get another token
continue
if self.previous.type == 'END':
# Use this NEWLINE if it follows END
self.previous = None
else:
# Get another token
self.previous = t
continue
else:
self.previous = t
return t
Upvotes: 2