How do I parse an array in antlr?

Question

I'm working on parsing PDF content streams. I'm having trouble defining an array. The definition of an array in the PDF reference (PDF 32000-1:2008) is:

An array object is a one-dimensional collection of objects arranged sequentially. …an array’s elements may be any combination of numbers, strings, dictionaries, or any other objects, including other arrays. An array may have zero elements.

An array shall be written as a sequence of objects enclosed in SQUARE BRACKETS (using LEFT SQUARE BRACKET (5Bh) and RIGHT SQUARE BRACKET (5Dh)).

EXAMPLE: [549 3.14 false (Ralph) /SomeName]

Here's a stripped-down version of my grammar:

grammar PdfStream;

/*
 * Parser Rules
 */

content : stat* ;

stat
    : array
    | string
    ;

array: ARRAY ;
string: STRING ;

/*
 * Lexer Rules
 */

ARRAY: '[' (ARRAY | DICTIONARY | OBJECT)* ']' ;

DICTIONARY: '<<' (NAME (ARRAY | DICTIONARY | OBJECT))*  '>>' ;

NULL: 'null' ;

BOOLEAN: ('true'|'false') ;

NUMBER: ('+' | '-')? (INT | FLOAT) ;

STRING: (LITERAL_STRING | HEX_STRING) ;

NAME: '/' ID ;

INT: DIGIT+ ;

LITERAL_STRING: '(' .*? ')' ;

HEX_STRING: '<' [0-9A-Za-z]+ '>' ;

FLOAT:  DIGIT+ '.' DIGIT*
     |         '.' DIGIT+
     ;

OBJECT
    : NULL
    | BOOLEAN
    | NUMBER
    | STRING
    | NAME
    ;

fragment DIGIT:   [0-9] ;

// All characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ 	
\u000C\u0000()<>[\]{}/%]+ ;

WS: [ 	
\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters

And here's the test file I'm processing.


(String1)
( String2 )
[]
[549 3.14 false (Ralph) /SomeName]

When I process the file with grun PdfStream tokens -tokens stream.txt I get this output:

line 5:0 token recognition error at: '[549 '
line 5:33 token recognition error at: ']'
[@0,0:5='',,1:0]
[@1,7:15='(String1)',,2:0]
[@2,17:27='( String2 )',,3:0]
[@3,29:30='[]',,4:0]
[@4,37:40='3.14',,5:5]
[@5,42:46='false',,5:10]
[@6,48:54='(Ralph)',,5:16]
[@7,56:64='/SomeName',,5:24]
[@8,67:66='',,6:0]

What's wrong with my grammar that's causing the token recognition errors?

sepp2k · Accepted Answer

[549 3.14 false (Ralph) /SomeName] isn't recognized as an ARRAY because it contains spaces and the rule for ARRAY does not allow any spaces. If you want spaces to be ignored between the elements of an array, you should turn it into a parser rule instead of a lexer rule (the same applies to DICTIONARY).

You'll also need to make OBJECT a parser rule because otherwise it will never be matched because any input that matches, say, NUMBER will always produce a NUMBER token instead of an OBJECT token because OBJECT comes last in the grammar. Generally you never want multiple lexer rules where everything that can be matched by one of them can also always be matched by at least one other. This also means that you want to turn INT and FLOAT into fragments.

How do I parse an array in antlr?

Answers (1)

Related Questions