Parsing expressions - unary, binary and incrementing/decrementing operators

Question

I am trying to write a parser for my language (for learning and fun). The problem is that I don't know how to parse expressions like a--b or a----b. When there are multiple operators made from - character - unary (-x), binary (x-y), pre-decrement (--x, like in C) and post-decrement (x--). Both a--b and a----b should be valid and produce:

a--b -> sub     a----b -> sub
       +-+-+            +--+--+
       a  neg         decr   neg
           |           |      |
           b           a      b

When lexer tokenizes a--b it does not know if it is decrement or minus sign repeated two times, so the parser must find out which one is it.
How could I determine if - is part of decrement operator or just minus sign?

rici · Accepted Answer

The problem is not really parsing so much as deciding the rules.

Why should a----b be a-- - -b and not a - -(--b)? For that matter, should a---b be a-- - b or a - --b?

And what about a---3 or 3---a? Neither 3-- nor --3 make any sense, so if the criteria were "choose (one of) the sensible interpretations", you'd end up with a-- - 3 and 3 - --a. But even if that were implementable without excess effort, it would place a huge cognitive load on coders and code readers.

Once upon a time, submitting a program for execution was a laborious and sometimes bureaucratic process, and having a run cancelled because the compiler couldn't find the correct interpretation was enormously frustrating. . I still carry the memories of my student days, waiting in a queue to hand my programs to an computer operator and then in another queue to receive the printed results.

So it became momentarily popular to create programming languages which went to heroic lengths to find a valid interpretation of what they were given. But that effort also meant that some bugs passed without error, because the programmer and the programming language having different understandings of what the "natural interpretation" might be.

If you program in C/C++, you may well have at some time written a & 3 == 0 instead of (a & 3) == 0. Fortunately, modern compilers will warn you about this bug, if warnings are enabled. But it's at least reasonable to ask whether the construct should even be permitted. Although it's a little annoying to have to add parentheses and recompile, it's not nearly as frustrating as trying to debug the obscure behaviour which results. Or to have accepted the code in a code review without noticing the subtle error.

These days, the compile / test / edit cycle is much quicker, so there's little excuse fir insisting on clarity. If I were writing a compiler today, I'd probably flag as an error any potentially ambiguous sequence of operator characters. But that might be going too far.

In most languages, a relatively simple rule is used: at each point in the program, the lexical analysis chooses the longest possible token, whether or not it "makes sense". That's what C and C++ do (mostly) and it has the advantage of being easy to implement and also easy to verify. (Even so, in a code review I would insist that a---b be written as a-- -b.)

You could slightly modify this rule so that only the first pair of -- is taken as a token, which would capture some of your desired parses without placing too much load on the code reader.

You could use even more complicated rules, but remember that whatever you implement, you have to document. If it's too hard to document clearly, it's probably unsuitable.

Once you articulate your list of rules, you work on the implementation. In many cases, the simplest is to just try the possibilities in order, with backtracking or in parallel. Or you might be able to just precompute the possible parses.

Alternatively, you could use a GLR or GLL parser generator capable of finding all parses in an ambiguous grammar, and then select the "best" parse based on whatever criteria you prefer.

Parsing expressions - unary, binary and incrementing/decrementing operators

Answers (1)

Related Questions