Reputation: 11

JavaCC Ambiguities: How do I tell the parser to chose a certain match from the the list of "longer matches"?

For some input, the parser presents a "Possible kinds of longer matches : { <EXPRESSION>, <TEXT> }", but for some odd reason it chooses the wrong one.

This is the source:

SKIP :
{
  " "  
| "\r"
| "\t"
| "\n"
}

TOKEN :
{
  < DOT : "." >
| < LBRACE : "{" >
| < RBRACE : "}" >
| < LBRACKET: "[" >
| < RBRACKET: "]" >
| < #LETTER : [ "a"-"z" ] >
| < #DIGIT : [ "0"-"9" ] >
| < #IDENTIFIER: < LETTER > (< LETTER >)* >
| < EXPRESSION : (< IDENTIFIER> < DOT > < IDENTIFIER> < DOT > < IDENTIFIER> ((< DOT > < IDENTIFIER> )* | < LBRACKET > (< DIGIT>)* < RBRACKET >)*)*>
| < TEXT : (( < DOT >)* ( < LETTER > )+ (< DOT >)*)* >
}

void q0() :
{Token token = null;}
{
    (
        < LBRACE > expression() < RBRACE >
    |   ( token = < TEXT >
            {
              getTextTokens().add( token.image );
            }
        )
    )* < EOF >
}


void expression() :
{Token token = null;}
{
  < EXPRESSION >
}

If we try to parse "a.bc.d" using this grammar it would say " FOUND A <EXPRESSION> MATCH (a.bc.d) "

My question is why did it choose to parse the input as an <EXPRESSION> instead of <TEXT>?

Also, how can I force the parser to choose the right path? I have tried countless LOOKAHEAD scenarios with no success.

The right path is for instance <TEXT> when using "a.bc.d" as input, and <EXPRESSION> for "{a.bc.d}".

Thanks in advance.

Upvotes: 1

Answers (2)

Theodore Norvell

Reputation: 16221

If expressions only appear within { braces }, only expressions (and white space) appear in braces, and braces are only used to delimit expressions, then you can do something like the following. See question 3.11 in the faq, if you are not familiar with lexical states.

// The following abbreviations hold in any state.
TOKEN : {
  < #LETTER : [ "a"-"z" ] >
| < #DIGIT : [ "0"-"9" ] >
| < #IDENTIFIER: < LETTER > (< LETTER >)* >
}

// Skip white space in either state
<DEFAULT,INBRACES> SKIP : { " "  | "\r" | "\t" | "\n" }

// The following are recognized in the default state.
// A left brace forces a switch to the INBRACES state.
<DEFAULT> TOKEN : {
  < DOT : "." >
| < LBRACE : "{" > : INBRACES
| < LBRACKET: "[" >
| < RBRACKET: "]" >
| < TEXT : (( < DOT >)* ( < LETTER > )+ (< DOT >)*)* >
}

// A right brace forces a switch to the DEFAULT state.
<DEFAULT, INBRACES > TOKEN {
  < RBRACE : "}"  > : DEFAULT
}

// Expressions are only recognized in the INBRACES state.
<INBRACES> TOKEN : {
  < EXPRESSION : (< IDENTIFIER> < DOT > < IDENTIFIER> < DOT > < IDENTIFIER> ((< DOT > < IDENTIFIER> )* | < LBRACKET > (< DIGIT>)* < RBRACKET >)*)*>
}

It looks a bit dodgy that DOT is defined in one state and used in another. However, I think that it works fine.

Upvotes: 1

Gunther

Reputation: 5256

From the JavaCC FAQ:

If more than one regular expression describes the longest possible prefix, then the regular expression that comes first in the .jj file is used.

So a preference can be established by ordering ambiguous definitions accordingly.

Upvotes: 2

JavaCC Ambiguities: How do I tell the parser to chose a certain match from the the list of &quot;longer matches&quot;?

Answers (2)

Related Questions

JavaCC Ambiguities: How do I tell the parser to chose a certain match from the the list of "longer matches"?