mahonya
mahonya

Reputation: 10075

Antlr syntactic predicate behaving inconsistent: why?

The following grammar works correctly when given this input:

cd/someotherpath/someotherpath/path

This input should be parsed as an identifier (cd) and an object path (someotherpath/someotherpath/path) seperated by a '/'

It took me hours to find a working grammar rule to do this. There is a commented out identified_path rule in the grammar, which did not work. The problem in that rule is, the standalone identifier in the beginning is accepted, even though there are characters that can't be parsed. This is what makes me confused. When the commented out rule is used, Antlr begins parsing, sees cd, recognizes it as an identifier, then sees '/', can't recognize it, and then leaves the rule without trying other alternatives!! Even if I force a syntactic predicate on identifier only alternative such as (identifier)=> identifier, Antlr accepts this incomplete match, and does not look at other alternatives.

After breaking the rule down as shown in the grammar, it works as expected, but I have no idea why the first one (uncommented below) is not working. It is simply the inlined version of the working one. Here is the grammar:

grammar RecursionTests;


@members{
    public  boolean isValidAlphanumericAtIdentifier(String pVal){
            if(pVal.toUpperCase().startsWith("A")){         
                //the second character must be a letter //
                char[] tempCharArr = pVal.substring(1,2).toCharArray();
                char secondChar = tempCharArr[0];
                if( secondChar < 'A' || secondChar > 'z')//is it a letter?
                    return false;

                //the second char is a letter, and this is alphanumeric, 
                //so there must be a third at least, but that third must be valid also 
                if(pVal.toUpperCase().startsWith("AT") && pVal.substring(2,3).toUpperCase().equals("T"))                
                    return false; //att is not allowed

                //passed all tests, it is valid
                return true;
            }
            return false;
        }

        public boolean isValidNonAtIdentifier(String pVal){
            if(pVal.length() > 1){
                return !(pVal.substring(1,2).toUpperCase().equals("T"));//second char should not be T
            }
            else
                return !(pVal.toUpperCase().equals("A"));//ok if it is not A


    }
}

rul :   identified_path
                         ;




//Identifier = {LetterMinusA}{IdCharMinusT}?{IdChar}* | 'a''t'?(({letter}|'_')*|{LetterMinusT}{Alphanumeric}*)

/*
identified_path 
    :   identifier
    |   (identifier forward_slash object_path)=> identifier forward_slash object_path
    |   identifier predicate        
    |   (identifier predicate forward_slash object_path)=>identifier predicate forward_slash object_path                                            
        ;
*/


identified_path
    :   identifier_or_id_based_path
    |   identifier_or_id_predicate_path
    ;

identifier_or_id_based_path
    :   identifier
    |   (identifier forward_slash object_path)=>(identifier forward_slash object_path)      
    ;        
identifier_or_id_predicate_path 
    :   identifier predicate    
    |   (identifier predicate forward_slash object_path)=>identifier predicate forward_slash object_path                                            
    ;

object_path  : path_part (forward_slash path_part)*
        ;

forward_slash
    : {input.LT(1).getText().equals("/")}? Uri_String_Chars
    ;

path_part : identifier (predicate)?
    ;

predicate : node_predicate
        ;       

node_predicate : square_bracket_open node_predicate_expr square_bracket_close
//node_predicate : square_bracket_open identifier square_bracket_close
        ;

square_bracket_open
    : {input.LT(1).getText().equals("[")}? Non_Uri_String_RegEX_Chars
    ;

square_bracket_close
    : {input.LT(1).getText().equals("]")}? Non_Uri_String_RegEX_Chars
    ;

node_predicate_expr
    :   (node_predicate_comparable ((And | Or) node_predicate_comparable)*)=>node_predicate_comparable ((And | Or) node_predicate_comparable)*
    ;       

node_predicate_comparable : (predicate_operand comparable_operator predicate_operand)=> predicate_operand comparable_operator predicate_operand
                          | Node_id
                          | (Node_id char_comma string_r)=> Node_id char_comma string_r        // node_id_r and name/value = <String> shortcut
                          | (Node_id char_comma parameter)=> Node_id char_comma parameter     // node_id_r and name/value = <Parameter> shortcut
                          | (node_predicate_reg_ex)=> node_predicate_reg_ex    // /items[{/at0001.* /}], /items[at0001 and name/value matches {//}
                          | (archetype_id)=>archetype_id
                          | (archetype_id char_comma string_r)=> archetype_id char_comma string_r        // node_id_r and name/value = <String> shortcut
                          | (archetype_id char_comma parameter)=> archetype_id char_comma parameter      // node_id_r and name/value = <Parameter> shortcut
        ;

predicate_operand : //identifier
            //| identifier PathItem
                     object_path
                     | operand 
        ;

operand : string_r | Integer_r |  | date_r | parameter | Boolean_r
        ;

string_r
    : (Quotation_Mark string_char* Quotation_Mark) 
    | Quote Quote string_char* Quote Quote
    ;

parameter
    :   char_dollar_sign Letter id_char*
    ;

archetype_id
    :   Letter char_hypen  Letter char_hypen archetype_id_letter_underscore_literal Dot (id_char|char_hypen) Dot alphanumeric 
    ;

archetype_id_letter_underscore_literal
    :   Letter
    |   Letter_or_underscore
    ;



comparable_operator
    :   char_equals | op_not_equals | char_greater | op_greater_or_eq | char_smaller | op_smaller_or_eq //Uri_String_Chars
    ;

char_equals
    :   {input.LT(1).getText().equals("=")}? Uri_String_Chars
    ;
op_not_equals
    :   {input.LT(1).getText().equals("!") && input.LT(2).getText().equals("=") }? (Uri_String_Chars Uri_String_Chars)
    ;
char_greater
    :   {input.LT(1).getText().equals(">")}? Special_Chars
    ;
op_greater_or_eq
    :   {input.LT(1).getText().equals(">") && input.LT(2).getText().equals("=") }? (Special_Chars Uri_String_Chars)
    ;

char_smaller
    :   {input.LT(1).getText().equals("<")}? Special_Chars
    ;
op_smaller_or_eq
    :   {input.LT(1).getText().equals("<") && input.LT(2).getText().equals("=") }? (Special_Chars Uri_String_Chars)
    ;


date_r  
    :   Quote Quote Single_Digit Single_Digit Single_Digit Single_Digit char_hypen Single_Digit Single_Digit char_hypen Single_Digit Single_Digit Single_Digit Single_Digit 
    ;


node_predicate_reg_ex    : reg_ex_pattern
                          | predicate_operand Op_matches reg_ex_pattern                          
        ;


reg_ex_pattern
    :   start_reg_ex_pattern reg_ex_char+ end_reg_ex_pattern
    ;

start_reg_ex_pattern
    :   {   input.LT(1).getText().equals("{") && 
            input.LT(2).getText().equals("/")
        }? (Non_Uri_String_RegEX_Chars Non_Uri_String_RegEX_Chars)
    ;

end_reg_ex_pattern
    :   {   input.LT(1).getText().equals("/") && 
            input.LT(2).getText().equals("}")
        }? (Non_Uri_String_RegEX_Chars Non_Uri_String_RegEX_Chars)
    ;

reg_ex_char
    :   alphanumeric | Uri_String_Chars | Non_Uri_String_RegEX_Chars
    ;


letter_minus_a
    :   {input.LT(1).getText().contains("a") == false && input.LT(1).getText().contains("A") == false}? Single_letter
    ;


letter_minus_t
    :   {input.LT(1).getText().contains("t") == false && input.LT(1).getText().contains("T") == false}? Single_letter
    ;   


id_char_minus_t
    :   {input.LT(1).getText().contains("t") == false && input.LT(1).getText().contains("T") == false}? Single_Id_Char
    ;


id_char
    :   Id_char
    |   Letter_or_underscore //may hit this since it is more specific than Id_char
    ;


alphanumeric //alternatives to alphanumeric will show up since they are more specific than alphanumeric, but may fit
    :   Alphanumeric
    |   Single_letter 
    |   Letter 
    ;

string_char
    :   String_char
    ;


char_low_case_a
    :   {input.LT(1).getText().equals("a")}? Single_letter
    ;

char_low_case_t
    :   {input.LT(1).getText().equals("t")}? Single_letter
    ;

char_comma
    :   {input.LT(1).getText().equals(",")}? Special_Chars
    ;   

char_dollar_sign
    :   {input.LT(1).getText().equals("$")}? Uri_String_Chars
    ;
char_hypen
    :   {input.LT(1).getText().equals("-")}? Uri_String_Chars
    ;

letter_or_underscore
    : Letter_or_underscore
    ;



//Identifier = {LetterMinusA}{IdCharMinusT}?{IdChar}* | 'a''t'?(({letter}|'_')*|{LetterMinusT}{Alphanumeric}*)  
identifier 
    :   {!(input.LT(1).getText().toUpperCase().startsWith("A")) }?=>non_at_identifier
    |   {input.LT(1).getText().toUpperCase().startsWith("A")}?=>at_identifier
    ;

non_at_identifier
    :   {isValidNonAtIdentifier(input.LT(1).getText())}?non_at_identifier_literal 
    ;

at_identifier
    :   at_identifier_literal
    ;

at_identifier_literal
    :   Single_letter //if it is only one letter, it must be a|A
    |   Letter  //if more than one letter, again it must start with a|A
    |   Letter_or_underscore
    |   {isValidAlphanumericAtIdentifier(input.LT(1).getText())}?Alphanumeric //if second char it t, third must be a non T LETTER
    ;   


non_at_identifier_literal
    :   Id_char     
    |   Alphanumeric
    |   Letter 
    |   Letter_or_underscore
    |   Single_letter
    ;


Node_id 
    :   At_code ( Digit+ (Dot Digit+)*)
    ;

At_code :   'at'
    ;

And :   'and'
    ;

Or  :   'or'
    ;

Dot :   '.'
    ;

Op_matches
    : 'matches'
    ;        

Boolean_r
    :   'true'| 'false'
    ;

Quote   :   '\''
    ;

Single_Digit
    :   Digit
    ;   

Integer_r
    :   Digit+
    ;

Float_r 
    :   Digit+ '.' Digit+
    ;

Single_letter
    :   Letter_lowercase | Letter_uppercase
    ;


Letter  :   (Letter_lowercase | Letter_uppercase)+
    ;





Alphanumeric
    :   (Letter_lowercase | Letter_uppercase | Digit)+  
    ;


Special_Chars   
    :   (Special_Char_list)+
    ;   

String_char
    : (Special_Char_list | Letter_lowercase | Letter_uppercase | Digit)+    
    ;   





Single_Id_Char
    :   Letter_lowercase | Letter_uppercase | Underscore | Digit 
    ;


Letter_or_underscore
    :   (Letter | Underscore)+ 
    ;

Id_char
    :   (Letter| Digit | Underscore)+
    ;

//Identifier = {LetterMinusA}{IdCharMinusT}?{IdChar}* | 'a''t'?(({letter}|'_')*|{LetterMinusT}{Alphanumeric}*)  




Uri_String_Chars
    :   '_' | '-' | '/' | ':' | '.' | '?' | '&' | '%' | '$' | '#' | '@' | '!' | '+' | '=' | '*'
    ;   

Non_Uri_String_RegEX_Chars//used for regex, alongside Uri_String_Chars
    :   '|' |'(' | ')' |'\\' | '^' | '{' |  '}' | '[' |  | ']'
    ;

Quotation_Mark
    :   '"'
    ;


fragment Special_Char_list
    :   //' '|  
        ','          
        | ';' | '<' | '>' 
        | '`'
        | '~'
    ;
/*
AND     :   'and'
    ;
OR  :   'or'
    ;
AT  :   'at'        
    ;
MATCHES :   'matches'
    ;
*/
WS  :   ( ' '
        | '\t'
        | '\r'
                | '\n'
        ) {$channel=HIDDEN;}
    ;

fragment Letter_uppercase
    :   'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z'
    ;

fragment Letter_lowercase
    :   'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z'
    ;


fragment Underscore
    :   '_'
    ;

fragment Digit  
    :   '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
    ;           

Upvotes: 1

Views: 239

Answers (1)

Bart Kiers
Bart Kiers

Reputation: 170278

I did not test it, but try putting the predicates at the start of the rule and try again:

identified_path 
 : (identifier predicate forward_slash object_path)=>
    identifier predicate forward_slash object_path
 | (identifier forward_slash object_path)=>
    identifier forward_slash object_path
 | identifier predicate
 | identifier
 ;

or:

identified_path 
 : (identifier predicate forward_slash object_path)=> 
    identifier predicate forward_slash object_path
 | (identifier forward_slash object_path)=>
    identifier forward_slash object_path
 | identifier predicate?
 ;

The parser goes through the alternatives from top to bottom: that's why the rules you force lookahead upon (the ones with predicates in front of it) are usually best placed at the top.

Upvotes: 1

Related Questions