Gumbo
Gumbo

Reputation: 655479

How to depend the length of one token on the value of another token?

I want to tokenize some binary data where the length of some parts depend on the value of a previous token. You can think of that as follows:

<length><binary data>

Let’s say length is two bytes unsigned integer that denotes the length of binary data in bytes.

How can I implement this correlation with ANTLR 4?

Upvotes: 4

Views: 571

Answers (1)

Bart Kiers
Bart Kiers

Reputation: 170227

You may need to extend ANTLR's input-streams. As of this moment the only input streams, ANTLRInputStream and ANTLRFileStream, are backed up by a char[] which may not suit your requirement to match any kind of binary data.

To make the lexer context sensitive as you described, you could:

  • match an UNSIGNED number token, and once this matches, initialize an instance variable (bytesToConsume) with this value;
  • once this bytesToConsume has been set, consume bytes/chars as long as this bytesToConsume is larger than 0!
  • of course, as soon as bytesToConsume has been initialized, you don't want to match a UNSIGNED token!

! these checks are performed by semantic predicates {boolean-expression}?.

A demo:

grammar T;

@lexer::members {

  private int bytesToConsume = -1;         

  boolean binary() {
    if(bytesToConsume < 0) {
      return false;
    }
    bytesToConsume--;
    return true;
  }
}

parse
 : block* EOF
 ;

block
 : UNSIGNED BINARY
 ;

UNSIGNED 
 : {!binary()}? 
   [0-9a-fA-F] [0-9a-fA-F] {bytesToConsume = Integer.parseInt(getText(), 16);}
 ;

BINARY
 : ({binary()}? . )+
 ;

A driver class:

import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.*;

public class Main {

  public static void main(String[] args) throws Exception {
    TLexer lexer = new TLexer(new ANTLRInputStream("03aaa0Fbbbbbbbbbbbbbbb01c"));
    TParser parser = new TParser(new CommonTokenStream(lexer));
    ParseTree tree = parser.parse();
    System.out.println(tree.toStringTree(parser));
  }
}

Test it by doing:

*nix

java -jar antlr-4.0-complete.jar T.g4
javac -cp .:antlr-4.0-complete.jar *.java
java -cp .:antlr-4.0-complete.jar Main

Windows

java -jar antlr-4.0-complete.jar T.g4
javac -cp .;antlr-4.0-complete.jar *.java
java -cp .;antlr-4.0-complete.jar Main

And you'll see the following being printed to the console (I added indentation though):

(parse 
  (block 03 aaa) 
  (block 0F bbbbbbbbbbbbbbb) 
  (block 01 c) 
  <EOF>)

EDIT

Perhaps something cleaner is possible by making use of ANTLR4's lexical modes. However, I'm quite new to v4 and I don't know if this is possible since you want to pop back to the default lexical scope once a certain amount of bytes/chars are consumed instead of a clear end in a BINARY-mode.

Upvotes: 1

Related Questions