Rex Redi
Rex Redi

Reputation: 93

ANTLR - Token Enumeration Mismatch Between Grammar and Tree Grammar

BackGround

I am trying to write a simple grammar, using AntlrWorks, for boolean equations that test sets of values for the existence (or lack there of) of specified elements. I have created a combined lexer/parser grammar that produces the desired ASTs. I have also written a cooresponding tree grammar, that seems to work (passes AntlrWorks' debug functions).


Problem

However, When I try to wire the two together in a test program (that is lex, parse, and tree parse in the same program), I get errors like...

node from line 1:5 required (...)+ loop did not match anything at input 'and'

and

node from after line 1:8 mismatched tree node: UP expecting <DOWN>

As a sanity test, I had the test program output the results of toStringTree() from the generated AST and toTokenTypeString() from the resulting TreeNodeStream.

What I found was that the enumerated token type values of the TreeNodeStream do not match the enumerated token type values in the autogenerated Tree Grammar code.


EXAMPLE

That token stream should be AND <DOWN> 'true' 'false' <UP> NEWLINE But the TreeParser sees it as CLOSEPAREN <DOWN> OR 'false' <UP> OPENPAREN (based off of looking at the node token type output and checking it against the enumeration defined in the tree grammar) and throws the error

1:5 required (...)+ loop did not match anything at input 'and'


Bottom Line

Why isn't my tree parser set up to properly identify my ASTs?

Below is my source. I appreciate any feedback on the foolish mistakes I must have made :)

Lexer/Parser Grammar

grammar INTc;

options {
   output=AST;
   ASTLabelType=CommonTree;
}

tokens {
   OR='or';
   AND='and';
   NOT='not';
   ALLIN='+';
   PARTIN='^';
   NOTIN='!';
   SET;
   OPENPAREN='(';
   CLOSEPAREN=')';
   OPENSET='{';
   CLOSESET='}';
}
@header {
package INTc;
}

@lexer::header {
package INTc;
}

@members {
}

/**Begin Parser Rules*/
prog:   stat+ ;

stat:   expr
    |   NEWLINE
    ;

expr
:  orExpr
;

orExpr returns [boolean value]
    :   a=andExpr(OR^ b=andExpr)*
    ;

andExpr returns [boolean value]
    :   a=notExpr (AND^ b=notExpr)*
    ; 

notExpr returns [boolean value]
    :   a=atom
    | '!' a=atom -> ^(NOT atom)
    ;

atom returns [boolean value]
    :   ALLIN  OPENSET ((INT)(','INT)*) CLOSESET   -> ^(ALLIN ^(SET INT+))
    |   PARTIN  OPENSET ((INT)(','INT)*) CLOSESET  -> ^(PARTIN ^(SET INT+))
    |   NOTIN OPENSET ((INT)(','INT)*) CLOSESET   -> ^(NOTIN  ^(SET INT+))
    |   TIMERANGE
    |   OPENPAREN! e=expr CLOSEPAREN!
    |   'true'
    |   'false'
    ;

/**Begin Lexer Rules*/
ID  :   ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')* ;
DIGIT   :   ('0'..'9');
INT :   DIGIT+ ;
NEWLINE :   '\r'? '\n' ;
WS  :   ( ' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
COMMENT
    :   '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
    |   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
    ;

Tree Grammar

tree grammar INTcWalker;

options {
  tokenVocab=INTc;
  ASTLabelType=CommonTree;
}

@header {
  package INTc;
  import java.util.ArrayList;
  import java.util.Arrays;
}

@members {
  ArrayList<String> intSet;
  boolean isFit = false;

  public boolean getResult() {
     return isFit;   
  }
  public void setINTSet(ArrayList newSet) {
     intSet = newSet;
     isFit = false;
  }
  public ArrayList getINTSET(){return intSet;}
}

prog
:     stat+
;
stat
:     expr  {
                                     isFit = $expr.value;
                                     //System.out.println(isFit);
    }
|    NEWLINE {}
;
expr returns [boolean value]
: ^(OR a=expr b=expr){}
| ^(AND a=expr b=expr){}
| ^(NOT a=expr){}
| ^(ALLIN ^(SET INT+)){}
| ^(PARTIN ^(SET INT+)){}
| ^(NOTIN ^(SET INT+)){}
| 'true'        {$value = true;}
| 'false'       {$value = false;}
;

Test Program

public class setTest {

    public static void main(String args[]) throws Exception {
        INTcLexer lex = new INTcLexer(new ANTLRFileStream("input.txt"));
        CommonTokenStream tokens = new CommonTokenStream(lex);

        INTcParser parser = new INTcParser(tokens);
        INTcParser.prog_return r = parser.prog();
        CommonTree t  = (CommonTree)r.getTree();
        CommonTreeNodeStream nodes = new CommonTreeNodeStream(t);
        INTcWalker evaluator = new INTcWalker(nodes);

        System.out.println(t.toStringTree());

        System.out.println(nodes.toTokenTypeString());
        nodes.reset();

        try {
                evaluator.prog();
        } catch (RecognitionException e) {
                e.printStackTrace();
        }   

        System.out.println(evaluator.getResult());

    }   
}

Upvotes: 4

Views: 2397

Answers (1)

Bart Kiers
Bart Kiers

Reputation: 170178

If I use your combined grammar and tree grammar to create lexer, parser and tree-walker classes, and run the following class:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;

public class Main {
  public static void main(String args[]) throws Exception {
    INTcLexer lex = new INTcLexer(new ANTLRStringStream("true and false\n"));
    CommonTokenStream tokens = new CommonTokenStream(lex);
    INTcParser parser = new INTcParser(tokens);

    CommonTree t  = (CommonTree)parser.prog().getTree();
    CommonTreeNodeStream nodes = new CommonTreeNodeStream(t);
    INTcWalker evaluator = new INTcWalker(nodes);

    System.out.println(t.toStringTree());

    CommonTree tr;
    while(true) {
      Token token = ((CommonTree)nodes.nextElement()).getToken();
      if(token.getType() == INTcParser.EOF) break;
      System.out.printf("%-10s '%s'\n", INTcParser.tokenNames[token.getType()], token.getText());
    }

    System.out.println("\nresult=" + evaluator.getResult());
  }
}

the following is printed to the console:

(and true false) 

AND        'and'
<DOWN>     'DOWN'
'true'     'true'
'false'    'false'
<UP>       'UP'
NEWLINE    '
'

result=false

I.e.: I see the expected output:

  • the tree is okay ((and true false));
  • CommonTreeNodeStream contains the proper tokens (or better: trees);
  • and the correct value, false, is being printed without any errors from either the parser or tree walker.

A couple of tips:

  • create tokens for both 'true' and 'false' (i.e. TRUE='true'; ...);
  • don't use literals inside your tree grammar (not 'true', but TRUE);
  • make DIGIT a fragment rule, that way it will never become a token of its own, but only used inside INT (or other lexer rules). Simply place the keyword fragment in front of it;
  • both .* and .+ are ungreedy by default, so you can remove the options greedy=false;} :.

Upvotes: 3

Related Questions