Reputation: 655479
I want to tokenize some binary data where the length of some parts depend on the value of a previous token. You can think of that as follows:
<length><binary data>
Let’s say length is two bytes unsigned integer that denotes the length of binary data in bytes.
How can I implement this correlation with ANTLR 4?
Upvotes: 4
Views: 571
Reputation: 170227
You may need to extend ANTLR's input-streams. As of this moment the only input streams, ANTLRInputStream
and ANTLRFileStream
, are backed up by a char[]
which may not suit your requirement to match any kind of binary data.
To make the lexer context sensitive as you described, you could:
UNSIGNED
number token, and once this matches, initialize an instance variable (bytesToConsume
) with this value;bytesToConsume
has been set, consume bytes/chars as long as this bytesToConsume
is larger than 0!bytesToConsume
has been initialized, you don't want to match a UNSIGNED
token!! these checks are performed by semantic predicates {boolean-expression}?
.
A demo:
grammar T;
@lexer::members {
private int bytesToConsume = -1;
boolean binary() {
if(bytesToConsume < 0) {
return false;
}
bytesToConsume--;
return true;
}
}
parse
: block* EOF
;
block
: UNSIGNED BINARY
;
UNSIGNED
: {!binary()}?
[0-9a-fA-F] [0-9a-fA-F] {bytesToConsume = Integer.parseInt(getText(), 16);}
;
BINARY
: ({binary()}? . )+
;
A driver class:
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.*;
public class Main {
public static void main(String[] args) throws Exception {
TLexer lexer = new TLexer(new ANTLRInputStream("03aaa0Fbbbbbbbbbbbbbbb01c"));
TParser parser = new TParser(new CommonTokenStream(lexer));
ParseTree tree = parser.parse();
System.out.println(tree.toStringTree(parser));
}
}
Test it by doing:
java -jar antlr-4.0-complete.jar T.g4 javac -cp .:antlr-4.0-complete.jar *.java java -cp .:antlr-4.0-complete.jar Main
java -jar antlr-4.0-complete.jar T.g4 javac -cp .;antlr-4.0-complete.jar *.java java -cp .;antlr-4.0-complete.jar Main
And you'll see the following being printed to the console (I added indentation though):
(parse
(block 03 aaa)
(block 0F bbbbbbbbbbbbbbb)
(block 01 c)
<EOF>)
Perhaps something cleaner is possible by making use of ANTLR4's lexical modes. However, I'm quite new to v4 and I don't know if this is possible since you want to pop back to the default lexical scope once a certain amount of bytes/chars are consumed instead of a clear end in a BINARY-mode.
Upvotes: 1