Reputation: 11
For some input, the parser presents a "Possible kinds of longer matches : { <EXPRESSION>, <TEXT> }", but for some odd reason it chooses the wrong one.
This is the source:
SKIP :
{
" "
| "\r"
| "\t"
| "\n"
}
TOKEN :
{
< DOT : "." >
| < LBRACE : "{" >
| < RBRACE : "}" >
| < LBRACKET: "[" >
| < RBRACKET: "]" >
| < #LETTER : [ "a"-"z" ] >
| < #DIGIT : [ "0"-"9" ] >
| < #IDENTIFIER: < LETTER > (< LETTER >)* >
| < EXPRESSION : (< IDENTIFIER> < DOT > < IDENTIFIER> < DOT > < IDENTIFIER> ((< DOT > < IDENTIFIER> )* | < LBRACKET > (< DIGIT>)* < RBRACKET >)*)*>
| < TEXT : (( < DOT >)* ( < LETTER > )+ (< DOT >)*)* >
}
void q0() :
{Token token = null;}
{
(
< LBRACE > expression() < RBRACE >
| ( token = < TEXT >
{
getTextTokens().add( token.image );
}
)
)* < EOF >
}
void expression() :
{Token token = null;}
{
< EXPRESSION >
}
If we try to parse "a.bc.d" using this grammar it would say " FOUND A <EXPRESSION> MATCH (a.bc.d) "
My question is why did it choose to parse the input as an <EXPRESSION> instead of <TEXT>?
Also, how can I force the parser to choose the right path? I have tried countless LOOKAHEAD scenarios with no success.
The right path is for instance <TEXT> when using "a.bc.d" as input, and <EXPRESSION> for "{a.bc.d}".
Thanks in advance.
Upvotes: 1
Views: 802
Reputation: 16221
If expressions only appear within { braces }, only expressions (and white space) appear in braces, and braces are only used to delimit expressions, then you can do something like the following. See question 3.11 in the faq, if you are not familiar with lexical states.
// The following abbreviations hold in any state.
TOKEN : {
< #LETTER : [ "a"-"z" ] >
| < #DIGIT : [ "0"-"9" ] >
| < #IDENTIFIER: < LETTER > (< LETTER >)* >
}
// Skip white space in either state
<DEFAULT,INBRACES> SKIP : { " " | "\r" | "\t" | "\n" }
// The following are recognized in the default state.
// A left brace forces a switch to the INBRACES state.
<DEFAULT> TOKEN : {
< DOT : "." >
| < LBRACE : "{" > : INBRACES
| < LBRACKET: "[" >
| < RBRACKET: "]" >
| < TEXT : (( < DOT >)* ( < LETTER > )+ (< DOT >)*)* >
}
// A right brace forces a switch to the DEFAULT state.
<DEFAULT, INBRACES > TOKEN {
< RBRACE : "}" > : DEFAULT
}
// Expressions are only recognized in the INBRACES state.
<INBRACES> TOKEN : {
< EXPRESSION : (< IDENTIFIER> < DOT > < IDENTIFIER> < DOT > < IDENTIFIER> ((< DOT > < IDENTIFIER> )* | < LBRACKET > (< DIGIT>)* < RBRACKET >)*)*>
}
It looks a bit dodgy that DOT is defined in one state and used in another. However, I think that it works fine.
Upvotes: 1
Reputation: 5256
From the JavaCC FAQ:
If more than one regular expression describes the longest possible prefix, then the regular expression that comes first in the .jj file is used.
So a preference can be established by ordering ambiguous definitions accordingly.
Upvotes: 2