Jamesy82
Jamesy82

Reputation: 379

How to write an Antlr4 grammar that matches X number of characters

I want to use Antlr4 to parse a format that stores the length of segments in the serialised form

For example, to parse: "6,Hello 5,World"

I tried to create a grammar like this

grammar myGrammar;

sequence:
 (LEN ',' TEXT)*;

LEN: [0-9]+;
TEXT: // I dont know what to put in here but it should match LEN number of chars

Is this even possible with Antlr?

A real world example of this would be parsing the messagePack binary format which has several types that serialise the length of the data into the serialised form.

For example there is the str8:

str 8 stores a byte array whose length is upto (2^8)-1 bytes:
+--------+--------+========+
|  0xd9  |YYYYYYYY|  data  |
+--------+--------+========+

And str16 type

str16 stores a byte array whose length is upto (2^16)-1 bytes:
+--------+--------+--------+========+
|  0xda  |ZZZZZZZZ|ZZZZZZZZ|  data  |
+--------+--------+--------+========+

In these examples the first byte identifies the type, then we have 1 byte for str8 and 2 bytes for str16 which contain the length of the data. Then finally there is the data.

I think a rule might look something like this but dont know how to match the right amount of data

str8 : '\u00d9' BYTE DATA ;
str16: '\u00da' BYTE BYTE DATA ;

BYTE : '\u0000'..'\u00FF' ;
DATA : ???

Upvotes: 0

Views: 513

Answers (1)

rici
rici

Reputation: 241861

The data format you describe is usually called TLV (tag/type–length–value). TLV cannot be recognised with a regular expression (or even with a context-free grammar) so it's not usually supported by standard tokenisers.

Fortunately, it's easy to tokenise. Standard libraries may exist for particular formats, and some formats even have automated code generators for more efficient parsing. But you should be able to write a simple tokeniser for a particular format in a few lines of code.

Once you have writen the datastream tokeniser, you could use a parser generator like Antlr to build a datastructure from the parse, but it's rarely nevessary. Most TLV-encoded streams are simple sequences of components, although you occasionally run into formats (like Google protobufs or ASN.1) which include nested subsequences. Even with those, the parse is straight-forward (although for both of those examples, standard tools exist).

In any event, using context-free grammar tools like Antlr is rarely the simplest solution, because TLV formats are mostly order-independent. (If the order were fixed, the tags wouldn't be necessary.) Context-free grammars do not have any way of handling a language such as "at most one of A, B, C, D, and E in any order" other than enumerating the alternatives, of which there are an exponential number.

Upvotes: 1

Related Questions