SimonFreeman
SimonFreeman

Reputation: 205

ANTLR Grammar for parsing a text file

I'm driving crazy trying to generate a parser Grammar with ANTLR. I've got plain text file like:

Diagram :   VW  503 FSX 09/02/2015  12/02/2015  STP
Fleet   :   AAAA

OFF     :       

AAA     05+44   5R06            
KKK     05+55   06.04   1R06            5530
ZZZ     06.24   06.30   1R06            5530
YYY     07.53           REVRSE      
YYY     08.23   9G98            5070


WORKS   :       

MILES   :(LD)   1288.35 (ETY)   3.18    (TOT)   1291.53

Each "Diagram" entity is contained beetween "Diagram :" and the "(TOT) before EOF. In the same plain txt file multiple "Diagram" entity can be present.

I've done some test with ANTRL

`grammar Hello2;

xxxt : diagram+;
diagram : DIAGRAM_ini  txt fleet LEGS+ DIAGRAM_end;
txt : TEXT;

fleet : FLEET_INI txt;
 num : NUMBER;
// Lexer Rules

DIAGRAM_ini : 'Diagram :';
DIAGRAM_end : '(TOT)' ;
LEGS : ('AAA' | 'KKK' | 'ZZZ' | 'YYY') ;
FLEET_INI :  'Fleet :';
TEXT : ('a'..'z')+ ;
NUMBER: ('0'..'9') ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ -> skip ;`

My Goal is to be able to parse Diagrams recursively, and gather all LEGS text/number.

Any help/tips is much more than appreciated! Many Thanks

Regs S.

Upvotes: 1

Views: 3951

Answers (1)

CoronA
CoronA

Reputation: 8075

I suggest not parsing the file like you did. This file does not define a language with words and grammar, but rather a formatted text of chars:

  • The formatting conventions are rather weak
  • The labels before the colon cannot serve as tokens since they may reappear in the body (AAA (=label) vs AAAA (=body)
  • The tokens must be very primitive to fit this requirements

Solution with ANTLR

You need a weaker grammar to solve this problem, e.g.

grammar diagrams;

diagrams : diagram+ ;

diagram : section+ ;

section : WORD ':' body? ;

body : textline+;

textline : (WORD | NUMBER | SIGNS)* ('\r' | '\n')+;

WORD : LETTER+ ;

NUMBER : DIGIT+ ;

SIGNS : SIGN+ ;

WHITESPACE : ( '\t' | ' ' )+ -> skip ;

fragment LETTER : ('a'..'z' | 'A'..'Z') ;

fragment SIGN : ('.'|'+'|'('|')'|'/') ;

fragment DIGIT : ('0'..'9') ;

Run a visitor on the Parsing result

  • to build up the normalized text of body
  • to filter out the LEGS lines out of the body
  • to parse a LEGS line with another parser (a regexp-parser would be sufficient here, but you could also define another ANTLR-Parser)

Another alternative:

Try out Packrat parsing (e.g. parboiled) - it is (especially for people with low experience in compiler construction) more comprehensible

  • it matches better to your grammar design
  • parboiled is pure java (grammar specified in java)

Disadvantages:

  • Whitespace handling must be done in Parser Rules
  • Debugging/Error Messages are a problem (with all packrat parsers)

Upvotes: 1

Related Questions