Reputation: 10728
I'm working on parsing PDF content streams. I'm having trouble defining an array. The definition of an array in the PDF reference (PDF 32000-1:2008) is:
An array object is a one-dimensional collection of objects arranged sequentially. …an array’s elements may be any combination of numbers, strings, dictionaries, or any other objects, including other arrays. An array may have zero elements.
An array shall be written as a sequence of objects enclosed in SQUARE BRACKETS (using LEFT SQUARE BRACKET (5Bh) and RIGHT SQUARE BRACKET (5Dh)).
EXAMPLE: [549 3.14 false (Ralph) /SomeName]
Here's a stripped-down version of my grammar:
grammar PdfStream;
/*
* Parser Rules
*/
content : stat* ;
stat
: array
| string
;
array: ARRAY ;
string: STRING ;
/*
* Lexer Rules
*/
ARRAY: '[' (ARRAY | DICTIONARY | OBJECT)* ']' ;
DICTIONARY: '<<' (NAME (ARRAY | DICTIONARY | OBJECT))* '>>' ;
NULL: 'null' ;
BOOLEAN: ('true'|'false') ;
NUMBER: ('+' | '-')? (INT | FLOAT) ;
STRING: (LITERAL_STRING | HEX_STRING) ;
NAME: '/' ID ;
INT: DIGIT+ ;
LITERAL_STRING: '(' .*? ')' ;
HEX_STRING: '<' [0-9A-Za-z]+ '>' ;
FLOAT: DIGIT+ '.' DIGIT*
| '.' DIGIT+
;
OBJECT
: NULL
| BOOLEAN
| NUMBER
| STRING
| NAME
;
fragment DIGIT: [0-9] ;
// All characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;
WS: [ \t\r\n\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters
And here's the test file I'm processing.
<AE93>
(String1)
( String2 )
[]
[549 3.14 false (Ralph) /SomeName]
When I process the file with grun PdfStream tokens -tokens stream.txt
I get this output:
line 5:0 token recognition error at: '[549 '
line 5:33 token recognition error at: ']'
[@0,0:5='<AE93>',<STRING>,1:0]
[@1,7:15='(String1)',<STRING>,2:0]
[@2,17:27='( String2 )',<STRING>,3:0]
[@3,29:30='[]',<ARRAY>,4:0]
[@4,37:40='3.14',<NUMBER>,5:5]
[@5,42:46='false',<BOOLEAN>,5:10]
[@6,48:54='(Ralph)',<STRING>,5:16]
[@7,56:64='/SomeName',<NAME>,5:24]
[@8,67:66='<EOF>',<EOF>,6:0]
What's wrong with my grammar that's causing the token recognition errors?
Upvotes: 1
Views: 1526
Reputation: 370112
[549 3.14 false (Ralph) /SomeName]
isn't recognized as an ARRAY
because it contains spaces and the rule for ARRAY
does not allow any spaces. If you want spaces to be ignored between the elements of an array, you should turn it into a parser rule instead of a lexer rule (the same applies to DICTIONARY
).
You'll also need to make OBJECT
a parser rule because otherwise it will never be matched because any input that matches, say, NUMBER
will always produce a NUMBER
token instead of an OBJECT
token because OBJECT
comes last in the grammar. Generally you never want multiple lexer rules where everything that can be matched by one of them can also always be matched by at least one other. This also means that you want to turn INT
and FLOAT
into fragment
s.
Upvotes: 2