Can I make my ANTLR4 Lexer discard a character from the input stream?

Question

I'm working on parsing PDF streams. In section 7.3.4.2 on literal string objects, the PDF Reference says that a backslash within a literal string that isn't followed by an end-of-line character, one to three octal digits, or one of the characters "nrtbf()" should be ignored. Is there a way to get the recover method in my lexer to ignore a backslash in this situation?

Here is my simplified parser:

parser grammar PdfStreamParser;

options { tokenVocab=PdfSteamLexer; } 

array: LBRACKET object* RBRACKET ;
dictionary: LDOUBLEANGLE (NAME object)* RDOUBLEANGLE ;
string: (LITERAL_STRING | HEX_STRING) ;
object
    : NULL
    | array
    | dictionary
    | BOOLEAN
    | NUMBER
    | string
    | NAME
    ;

content : stat* ;

stat
    : tj
    ;

tj: ((string Tj) | (array TJ)) ; // Show text

Here's the lexer. (Based on the advice in this answer I'm not using a separate string mode):

lexer grammar PdfStreamLexer;

Tj: 'Tj' ;
TJ: 'TJ' ;

NULL: 'null' ;

BOOLEAN: ('true'|'false') ;

LBRACKET: '[' ;
RBRACKET: ']' ;
LDOUBLEANGLE: '<<' ;
RDOUBLEANGLE: '>>' ;

NUMBER: ('+' | '-')? (INT | FLOAT) ;

NAME: '/' ID ;

// A sequence of literal characters enclosed in parentheses.
LITERAL_STRING: '(' ( ~[()\]+ | ESCAPE_SEQUENCE | LITERAL_STRING )* ')' ; 

// Escape sequences that can occur within a LITERAL_STRING
fragment ESCAPE_SEQUENCE 
    : '\' ( [
nrtbf()\] | [0-7] [0-7]? [0-7]? )
    ;

HEX_STRING: '<' [0-9A-Za-z]+ '>' ; // Hexadecimal data enclosed in angle brackets

fragment INT: DIGIT+ ; // match 1 or more digits

fragment FLOAT:  DIGIT+ '.' DIGIT*  // match 1. 39. 3.14159 etc...
    |         '.' DIGIT+  // match .1 .14159
    ;

fragment DIGIT:   [0-9] ;        // match single digit

// Accept all characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ 	
\u000C\u0000()<>[\]{}/%]+ ;

WS: [ 	
\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters

I can override the recover method in the PdfStreamLexer class and get notified when the LexerNoViableAltException occurs, but I'm not sure how to (or if it's possible to) ignore the backslash and continue on with the LITERAL_STRING tokenization.

Bart Kiers · Accepted Answer

To be able to skip part of the string, you'll need to use lexical modes. Here's a quick demo:

lexer grammar DemoLexer;

STRING_OPEN
 : '(' -> pushMode(STRING_MODE)
 ;

SPACES
 : [ \t\r\n] -> skip
 ;

OTHER
 : .
 ;

mode STRING_MODE;

  STRING_CLOSE
   : ')' -> popMode
   ;

  ESCAPE
   : '\' ( [nrtbf()\] | [0-7] [0-7] [0-7] )
   ;

  STRING_PART
   : ~[$)]
   ;

  NESTED_STRING_OPEN
   : '(' -> type(STRING_OPEN), pushMode(STRING_MODE)
   ;

  IGNORED_ESCAPE
   : '\' . -> skip
   ;

which can be used in the parser as follows:

parser grammar DemoParser;

options {
  tokenVocab=DemoLexer;
}

parse
 : ( string | OTHER )* EOF
 ;

string
 : STRING_OPEN ( ESCAPE | STRING_PART | string )* STRING_CLOSE
 ;

If you now parse the string FU(abc(def)\@$)BAR, you will get the following parse tree:

As you can see, the \) is left in the tree, but \@ is omitted.

Can I make my ANTLR4 Lexer discard a character from the input stream?

Answers (1)

Related Questions