Reputation: 379
I want to use Antlr4 to parse a format that stores the length of segments in the serialised form
For example, to parse: "6,Hello 5,World"
I tried to create a grammar like this
grammar myGrammar;
sequence:
(LEN ',' TEXT)*;
LEN: [0-9]+;
TEXT: // I dont know what to put in here but it should match LEN number of chars
Is this even possible with Antlr?
A real world example of this would be parsing the messagePack binary format which has several types that serialise the length of the data into the serialised form.
For example there is the str8:
str 8 stores a byte array whose length is upto (2^8)-1 bytes:
+--------+--------+========+
| 0xd9 |YYYYYYYY| data |
+--------+--------+========+
And str16 type
str16 stores a byte array whose length is upto (2^16)-1 bytes:
+--------+--------+--------+========+
| 0xda |ZZZZZZZZ|ZZZZZZZZ| data |
+--------+--------+--------+========+
In these examples the first byte identifies the type, then we have 1 byte for str8 and 2 bytes for str16 which contain the length of the data. Then finally there is the data.
I think a rule might look something like this but dont know how to match the right amount of data
str8 : '\u00d9' BYTE DATA ;
str16: '\u00da' BYTE BYTE DATA ;
BYTE : '\u0000'..'\u00FF' ;
DATA : ???
Upvotes: 0
Views: 513
Reputation: 241861
The data format you describe is usually called TLV (tag/type–length–value). TLV cannot be recognised with a regular expression (or even with a context-free grammar) so it's not usually supported by standard tokenisers.
Fortunately, it's easy to tokenise. Standard libraries may exist for particular formats, and some formats even have automated code generators for more efficient parsing. But you should be able to write a simple tokeniser for a particular format in a few lines of code.
Once you have writen the datastream tokeniser, you could use a parser generator like Antlr to build a datastructure from the parse, but it's rarely nevessary. Most TLV-encoded streams are simple sequences of components, although you occasionally run into formats (like Google protobufs or ASN.1) which include nested subsequences. Even with those, the parse is straight-forward (although for both of those examples, standard tools exist).
In any event, using context-free grammar tools like Antlr is rarely the simplest solution, because TLV formats are mostly order-independent. (If the order were fixed, the tags wouldn't be necessary.) Context-free grammars do not have any way of handling a language such as "at most one of A, B, C, D, and E in any order" other than enumerating the alternatives, of which there are an exponential number.
Upvotes: 1