Reputation: 487
A results file from an Engineering software consists of many pages with a header line and some rows of data per page. Every header line consists of:
An example of such header line is:
The software assigns the next six characters after PAGE for page numbering.
The parser works fine except for documents with more than 99999 pages, for which the software outputs Strings like PAGE123456 without spaces between PAGE and page number (yes, some software generates such a huge amount of data).
The first grammar I tried is:
grammar F06Reader01;
readF06: dataBlock+ EOF;
dataBlock: pageLine row+;
pageLine: ONE_AT_FIRST_POS ALPNUM* PAGEATPOS ALPNUM NL;
row: ALPNUM* NL ;
PAGEATPOS: P_ATPOS A_ATPOS G_ATPOS E_ATPOS;
P_ATPOS : 'P' {getCharPositionInLine() == 119}?;
A_ATPOS : 'A' {getCharPositionInLine() == 120}?;
G_ATPOS : 'G' {getCharPositionInLine() == 121}?;
E_ATPOS : 'E' {getCharPositionInLine() == 122}?;
ONE_AT_FIRST_POS : '1' {getCharPositionInLine() == 1}?;
ALPNUM : (LETTER | DIGIT)+;
DIGIT: [0-9] ;
LETTER: ~[ \t\n\r\u0030-\u0039]; //everything but DIGITS, NL or WL
NL: '\r'? '\n';
WS : [ \t]+ ->skip;
The tokens generated defines PAGE231236 as ALPNUM since it founds it larger than PAGE.
After finding this issue, I modified g4 file to add a lexical mode (PAGENUM) to activate when the lexer finds PAGE but this not happens and still the lexer produces ALPNUM tokens.
Below is the lexer file:
lexer grammar ModeTest01Lexer;
PAGEATPOS: P_ATPOS A_ATPOS G_ATPOS E_ATPOS -> mode(PAGENUM);
P_ATPOS : 'P' {getCharPositionInLine() == 119}?;
A_ATPOS : 'A' {getCharPositionInLine() == 120}?;
G_ATPOS : 'G' {getCharPositionInLine() == 121}?;
E_ATPOS : 'E' {getCharPositionInLine() == 122}?;
ONE_AT_FIRST_POS : '1' {getCharPositionInLine() == 1}?;
ALPNUM : (LETTER | DIGIT)+;
DIGIT: [0-9] ;
LETTER: ~[ \t\n\r\u0030-\u0039]; //everything but DIGITS, NL or WL
NL: '\r'? '\n';
WS : [ \t]+ ->skip;
mode PAGENUM;
NUM : [0-9]+;
WS2 : [ \t]+ ->skip;
NL2: '\r'? '\n' -> mode(DEFAULT_MODE);
And the Parser:
parser grammar ModeTest01;
options { tokenVocab=ModeTest01Lexer; }
modeTest: dataBlock+ EOF;
dataBlock: pageLine row+;
pageLine: ONE_AT_FIRST_POS ALPNUM* PAGEATPOS NUM NL2;
row: ALPNUM* NL ;
This code still consumes PAGE123456 as ALPNUM instead of changing to PAGENUM mode after PAGE is found as shown in the following example and its AST:
1 MSC.NASTRAN JOB MARCH 12, 2020 MSC Nastran 11/27/13 PAGE992306
LC01 row
1 MSC.NASTRAN JOB MARCH 12, 2020 MSC Nastran 11/27/13 PAGE 2306
another row of data
Upvotes: 1
Views: 177
Reputation: 170227
You could use multiple lexer modes:
1
at the start of the line, you push HEADER_MODE
HEADER_MODE
and you encounter PAGE
, you push PAGE_NUMBER_MODE
(every (single) other character you skip in this mode)Something like this:
lexer grammar NastranLexer;
ONE_AT_FIRST_POS
: {getCharPositionInLine() == 0}? '1' -> pushMode(HEADER_MODE)
;
NL
: '\r'? '\n'
;
OTHER
: .
;
mode HEADER_MODE;
HEADER_MODE_PAGE
: 'PAGE' -> pushMode(PAGE_NUMBER_MODE)
;
HEADER_MODE_ANY
: . -> skip
;
mode PAGE_NUMBER_MODE;
PAGE_NUMBER_MODE_NUMBER
: [0-9]+ -> mode(DEFAULT_MODE)
;
PAGE_NUMBER_MODE_SPACE
: [ \t] -> skip
;
the parser grammar could look like this:
parser grammar NastranParser;
options {
tokenVocab=NastranLexer;
}
read
: page* EOF
;
page
: header NL row+
;
header
: ONE_AT_FIRST_POS HEADER_MODE_PAGE PAGE_NUMBER_MODE_NUMBER
;
row
: OTHER* NL
;
And when you run this:
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.ParseTree;
public class Main {
public static void main(String[] args) {
String source = "1 MSC.NASTRAN JOB MARCH 12, 2020 MSC Nastran 11/27/13 PAGE 2306\n" +
"some data\n" +
"1 MSC.NASTRAN JOB MARCH 12, 2020 MSC Nastran 11/27/13 PAGE 2307\n" +
"some more data\n";
NastranLexer lexer = new NastranLexer(CharStreams.fromString(source));
NastranParser parser = new NastranParser(new CommonTokenStream(lexer));
ParseTree parseTree = parser.read();
System.out.println(parseTree.toStringTree(parser));
}
}
the following is printed:
(read
(page
(header 1 PAGE 2306) \n
(row s o m e d a t a \n))
(page
(header 1 PAGE 2307) \n
(row s o m e m o r e d a t a \n)) <EOF>)
(I added some line breaks in the output above)
Upvotes: 2