ANTLR4, Matching shorter character sequence with Lexical Modes

Question

A results file from an Engineering software consists of many pages with a header line and some rows of data per page. Every header line consists of:

Character ‘1’ at the first position of the line
Some alphanumeric characters (general data not to be parsed)
String 'PAGE' at position 122 of the line
Numeric string (page number)
NL

An example of such header line is:

The software assigns the next six characters after PAGE for page numbering.

The parser works fine except for documents with more than 99999 pages, for which the software outputs Strings like PAGE123456 without spaces between PAGE and page number (yes, some software generates such a huge amount of data).

The first grammar I tried is:

grammar F06Reader01;
readF06: dataBlock+ EOF;
dataBlock: pageLine row+;
pageLine: ONE_AT_FIRST_POS ALPNUM* PAGEATPOS ALPNUM NL;
row: ALPNUM* NL ;
PAGEATPOS: P_ATPOS A_ATPOS G_ATPOS E_ATPOS;
P_ATPOS             :   'P'          {getCharPositionInLine() == 119}?;
A_ATPOS             :   'A'          {getCharPositionInLine() == 120}?;
G_ATPOS             :   'G'          {getCharPositionInLine() == 121}?;
E_ATPOS             :   'E'          {getCharPositionInLine() == 122}?;
ONE_AT_FIRST_POS    :   '1'          {getCharPositionInLine() == 1}?;
ALPNUM : (LETTER | DIGIT)+;
DIGIT: [0-9] ;
LETTER: ~[ 	

\u0030-\u0039]; //everything but DIGITS, NL or WL
NL: '
'? '
';
WS : [ 	]+ ->skip;

The tokens generated defines PAGE231236 as ALPNUM since it founds it larger than PAGE.

After finding this issue, I modified g4 file to add a lexical mode (PAGENUM) to activate when the lexer finds PAGE but this not happens and still the lexer produces ALPNUM tokens.

Below is the lexer file:

lexer grammar ModeTest01Lexer;
PAGEATPOS: P_ATPOS A_ATPOS G_ATPOS E_ATPOS -> mode(PAGENUM);
P_ATPOS             :   'P'          {getCharPositionInLine() == 119}?;
A_ATPOS             :   'A'          {getCharPositionInLine() == 120}?;
G_ATPOS             :   'G'          {getCharPositionInLine() == 121}?;
E_ATPOS             :   'E'          {getCharPositionInLine() == 122}?;
ONE_AT_FIRST_POS    :   '1'          {getCharPositionInLine() == 1}?;
ALPNUM : (LETTER | DIGIT)+;
DIGIT: [0-9] ;
LETTER: ~[ 	

\u0030-\u0039]; //everything but DIGITS, NL or WL
NL: '
'? '
';
WS : [ 	]+ ->skip;

mode PAGENUM;
NUM : [0-9]+;
WS2 : [ 	]+ ->skip;
NL2: '
'? '
' -> mode(DEFAULT_MODE);

And the Parser:

parser grammar ModeTest01;
options { tokenVocab=ModeTest01Lexer; }
modeTest: dataBlock+ EOF;
dataBlock: pageLine row+;
pageLine: ONE_AT_FIRST_POS ALPNUM* PAGEATPOS NUM NL2;
row: ALPNUM* NL ;

This code still consumes PAGE123456 as ALPNUM instead of changing to PAGENUM mode after PAGE is found as shown in the following example and its AST:

1    MSC.NASTRAN JOB                                                          MARCH  12, 2020  MSC Nastran 11/27/13   PAGE992306
     LC01 row
1    MSC.NASTRAN JOB                                                          MARCH  12, 2020  MSC Nastran 11/27/13   PAGE  2306
      another row of data

Bart Kiers · Accepted Answer

You could use multiple lexer modes:

when you encounter 1 at the start of the line, you push HEADER_MODE
when in HEADER_MODE and you encounter PAGE, you push PAGE_NUMBER_MODE (every (single) other character you skip in this mode)

Something like this:

lexer grammar NastranLexer;

ONE_AT_FIRST_POS
 : {getCharPositionInLine() == 0}? '1' -> pushMode(HEADER_MODE)
 ;

NL
 : '
'? '
'
 ;

OTHER
 : .
 ;

mode HEADER_MODE;

  HEADER_MODE_PAGE
   : 'PAGE' -> pushMode(PAGE_NUMBER_MODE)
   ;

  HEADER_MODE_ANY
   : . -> skip
   ;

mode PAGE_NUMBER_MODE;

  PAGE_NUMBER_MODE_NUMBER
   : [0-9]+ -> mode(DEFAULT_MODE)
   ;

  PAGE_NUMBER_MODE_SPACE
   : [ 	] -> skip
   ;

the parser grammar could look like this:

parser grammar NastranParser;

options {
  tokenVocab=NastranLexer;
}

read
 : page* EOF
 ;

page
 : header NL row+
 ;

header
 : ONE_AT_FIRST_POS HEADER_MODE_PAGE PAGE_NUMBER_MODE_NUMBER
 ;

row
 : OTHER* NL
 ;

And when you run this:

import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.ParseTree;

public class Main {

    public static void main(String[] args) {

        String source = "1    MSC.NASTRAN JOB                                                          MARCH  12, 2020  MSC Nastran 11/27/13   PAGE  2306
" +
                "some data
" +
                "1    MSC.NASTRAN JOB                                                          MARCH  12, 2020  MSC Nastran 11/27/13   PAGE  2307
" +
                "some more data
";

        NastranLexer lexer = new NastranLexer(CharStreams.fromString(source));
        NastranParser parser = new NastranParser(new CommonTokenStream(lexer));

        ParseTree parseTree = parser.read();
        System.out.println(parseTree.toStringTree(parser));
    }
}

the following is printed:

(read
  (page
    (header 1 PAGE 2306) 

    (row s o m e   d a t a 
))
  (page
    (header 1 PAGE 2307) 

    (row s o m e   m o r e   d a t a 
)) )

(I added some line breaks in the output above)

ANTLR4, Matching shorter character sequence with Lexical Modes

Answers (1)

Related Questions