neolei
neolei

Reputation: 1948

ANTLR parser, can I stop at first match?

I’m playing with ANTLR to write a parser for PDF object structure, but I encounter a problem to parse a string stream mixed with PDF Reference and Integer.

Basically, a PDF Reference is a string like this: "10 0 R" (INTEGER SPACE INTEGER SPACE ‘R’).

Here's my grammar file (simplified):

grammar Pdf;

options {
language=CSharp3;
backtrack=true;
}

public r returns [string val]
    :   ref {$val = $r.text;}
    |   INTEGER {$val = $r.text;}
    ;

ref
    :   INTEGER SPACE INTEGER SPACE 'R';

INTEGER
    :   DIGIT+;

SPACE: ' ';

fragment DIGIT
    :   '0'..'9'
    ;

Here's the test code (in C#):

byte[] bytes = Encoding.ASCII.GetBytes("97 98 10 0 R 100 101");
MemoryStream stream = new MemoryStream(bytes);

ANTLRInputStream inputStream = new ANTLRInputStream(stream);
PdfLexer lexer = new PdfLexer(inputStream);
CommonTokenStream tokens = new CommonTokenStream(lexer);

PdfParser parser = new PdfParser(tokens);
string result = parser.r();

I expect result to be the first rule matched in rule r (be it ref or INTEGER).

For example:

No need to go through the whole string stream. Just match first rule, then stop.

I'm newbie to ANTLR and couldn't figure it out how to do it. I'm using ANTLRWorks 1.4.3 and antlr-dotnet-csharpruntime-3.4.1.9004.

Any help is appreciated!

Upvotes: 0

Views: 1233

Answers (1)

Bart Kiers
Bart Kiers

Reputation: 170148

backtrack=true only applies to parser rules: not lexer rules. So when the lexer stumbles upon INTEGER SPACE followed by something other than INTEGER, the lexer will throw an error/exception: it will not backtrack in the REF rule and create an INTEGER and SPACE token instead.

But REF shouldn't be a lexer rule to begin with, but a parser rule instead:

ref
 : INTEGER SPACE INTEGER SPACE 'R'
 ;

Edit

I'm on Linux, and can therefore not test the C# target (at least, I've never been able to get the CSharp3 target running inside MonoDevelop). But here's a Java demo:

grammar Pdf;

public r
 : ( ref     {System.out.println("ref     = '" + $ref.text + "'");}
   | INTEGER {System.out.println("INTEGER = '" + $INTEGER.text + "'");}
   | SPACE   {System.out.println("SPACE   = '" + $SPACE.text + "'");}
   )*
   EOF
 ;

ref
 : INTEGER SPACE INTEGER SPACE 'R'
 ;

INTEGER
 : DIGIT+;

SPACE
 : ' '
 ;

fragment DIGIT
 : '0'..'9'
 ;

You can test the parser with the class:

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    PdfLexer lexer = new PdfLexer(new ANTLRStringStream("97 98 10 0 R 100 101"));
    PdfParser parser = new PdfParser(new CommonTokenStream(lexer));
    parser.r();
  }
}

and if you run this class, the following is printed:

INTEGER = '97'
SPACE   = ' '
INTEGER = '98'
SPACE   = ' '
ref     = '10 0 R'
SPACE   = ' '
INTEGER = '100'
SPACE   = ' '
INTEGER = '101'

which is exactly as I expected.

Upvotes: 1

Related Questions