marchemike
marchemike

Reputation: 3277

ANTLR basic example in Java

I have been searching the net for the past few hours, trying to learn a simple example of using ANTLR.But I am having a hard time understanding the examples. Does any body have simple example that would output this in Java:

if my input is printf("Hello World");

the output should be :

Hello World

and if my input is

inx =1;

it should give an error message.

I'm trying to create a c++ compiler(starting with the lexical up until the semantic part only) using java,and I would really like to know what I should do.

Upvotes: 0

Views: 6000

Answers (2)

Austin Henley
Austin Henley

Reputation: 4633

From ANTLR here is the trivial example of parsing (and evaluating) an expression.

grammar Expr;

@header {
package test;
import java.util.HashMap;
}

@lexer::header {package test;}

@members {
/** Map variable name to Integer object holding value */
HashMap memory = new HashMap();
}

prog:   stat+ ;

stat:   expr NEWLINE {System.out.println($expr.value);}
    |   ID '=' expr NEWLINE
        {memory.put($ID.text, new Integer($expr.value));}
    |   NEWLINE
    ;

expr returns [int value]
    :   e=multExpr {$value = $e.value;}
        (   '+' e=multExpr {$value += $e.value;}
        |   '-' e=multExpr {$value -= $e.value;}
        )*
    ;

multExpr returns [int value]
    :   e=atom {$value = $e.value;} ('*' e=atom {$value *= $e.value;})*
    ; 

atom returns [int value]
    :   INT {$value = Integer.parseInt($INT.text);}
    |   ID
        {
        Integer v = (Integer)memory.get($ID.text);
        if ( v!=null ) $value = v.intValue();
        else System.err.println("undefined variable "+$ID.text);
        }
    |   '(' e=expr ')' {$value = $e.value;}
    ;

    ID  :   ('a'..'z'|'A'..'Z')+ ;
    INT :   '0'..'9'+ ;
    NEWLINE:'\r'? '\n' ;
    WS  :   (' '|'\t')+ {skip();} ;

But like I mentioned in my comments, C++ is very hard to parse correctly. There are many ambiguities and requires * amount of look ahead (which ANTLR does provide). So doing this in any efficient form is complicated. That is why I recommend implementing something like PL/0 which was designed for students to write their first compiler for. Tiny BASIC is also a good start. Both of these can be implemented without using a tool like ANTLR by doing recursive descent. I have implemented both in under 1000 lines together (in C++ and C# respectively).

ANTLR is a great tool though, especially once you get your head wrapped around recursive descent you might want to upgrade to a more powerful parser. I recommend both of Terrence Parr's books, ANTLR Reference and Language Implementation Patterns. The ANTLR book will tell you everything (plus some) that you want to know about ANTLR. The second book will teach you all about parsers and compilers, from recursive descent to black-magic backtracking.

More resources from a similar SO question can be found here. And if you're into Lisp or Scheme, you can check out JScheme, it is written in Java (less than 1000 lines I believe).

Upvotes: 2

ccoakley
ccoakley

Reputation: 3255

Here is a grammar that almost does what you want:

grammar PrintLang;

sentence 
    :    statement
    ;

statement 
    :   functionCall '(' argument ')' ';'
    { 
      if ($functionCall.funName.equals("printf")) {
        System.out.println($argument.arg);
      }
    }
    ;

functionCall returns [String funName]
    :    ID 
    { $funName = $ID.text; }
    ;

argument returns [String arg]
    :   STRING
    { $arg = $STRING.text; }
    ;

ID  :   ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
    ;

WS  :   ( ' '
        | '\t'
        | '\r'
        | '\n'
        ) {$channel=HIDDEN;}
    ;

STRING
    :  '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
    ;

fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;

fragment
ESC_SEQ
    :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
    |   UNICODE_ESC
    |   OCTAL_ESC
    ;

fragment
OCTAL_ESC
    :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7')
    ;

fragment
UNICODE_ESC
    :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
    ;

I generated this in AntlrWorks. All of the token rules were generated for me.

Here is the java file to test it.

import org.antlr.runtime.*;


public class PrintIt {
  public static void main(String args[]) {
    String inputString = "printf(\"HelloWorld\");";

    // Create an input character stream from standard in
    ANTLRStringStream input = new ANTLRStringStream(inputString); 
    // Create an ExprLexer that feeds from that stream 
    PrintLangLexer lexer = new PrintLangLexer(input);
    // Create a stream of tokens fed by the lexer 
    CommonTokenStream tokens = new CommonTokenStream(lexer); 
    // Create a parser that feeds off the token stream 
    PrintLangParser plParser = new PrintLangParser(tokens);
    try {
        plParser.sentence();
    } catch (Exception e) {
        e.printStackTrace();
    }
  }
}

You'll note that this java code is almost a verbatim copy/paste from the Antlr website example (I don't believe I even changed the comments, which is why the comment refers to Standard in, but the code actually uses a String). And here is the command line I used to do it.

bash$ java -cp ./antlr-3.4-complete.jar org.antlr.Tool PrintLang.g
bash$ javac -cp ./:./antlr-3.4-complete.jar PrintIt.java 
bash$ java -cp antlr-3.4-complete.jar:. PrintIt
"HelloWorld"

Oops, I forgot that the string I wanted to print isn't the matched token ("HelloWorld", including the quotes), it's the string within the quotes.

Also, you'll note that I hardcoded the lookup of printf as a string comparison. In reality, you'll want an environment that contains the symbols accessible at a given scope (related, see antlr's "scope" construct. More difficult, though sometimes useful: create an environment that you pass to each parsing rule).

Most important: find Bart Kiers answers by searching SO for more antlr questions. He posts excellent examples.

Upvotes: 5

Related Questions