Chylomicron
Chylomicron

Reputation: 251

Writing custom parser using ANTLR4, C# and VS2017

I am trying to parse files that have a format as below. What I'd like to be able to do is create a few vars and an array of structs to contain information about the file. For example there could be (pseudocode) int atomNumber = 27 and then string struct[0].element = C & float struct[0].x = 0.6877350. I would need a parser/script that allows me to get that information out of the file. I'm trying to do this with ANTLR/custom parser instead of Regex because some of these files get big; I was doing this in Unity and Regex was slow due to its effect on Unity's garbage collection. However I'm new to using ANTLR so I'm struggling to get this to work.

Here is an example of the file structure:

27
 comment 
C 0.6877350 0.0715370 -1.2710340
C 0.0387890 -0.2132770 2.4629140
C 2.9026270 0.7676750 -0.4325690
C 1.9897370 0.2682320 -1.4699130
H 3.8932221 0.3135170 -0.4057700
H 2.3979900 0.0407470 -2.4529259
H 0.0607820 -1.2602330 2.1661179
H 0.1969520 0.2155350 -0.3100010
O -0.1311780 -0.2708320 -2.3204310
C -1.0381711 -1.2940150 -2.2029400
O -1.8197130 -1.4577920 -3.0971811
C -0.9588580 -2.1451390 -0.9617850
H 0.0764330 -2.4116330 -0.7270430
H -1.5478179 -3.0458710 -1.1361990
H -1.3813020 -1.6050299 -0.1051530
C 1.0799000 0.3532050 3.0701120
H 1.9819210 -0.2100150 3.2911880
H 1.0789630 1.4006370 3.3635521
C 2.6065600 1.7304400 0.4415480
H 3.3160350 2.0450499 1.2023780
H 1.6531780 2.2537251 0.4046360
C -1.2067200 0.5084640 2.0837281
C -1.4050720 1.9360650 2.5400000
H -0.6525800 2.5874960 2.0799990
H -2.3981910 2.2678981 2.2341671
H -1.3007090 2.0248840 3.6260951
O -2.0373070 -0.0501320 1.3878731
27
 comment 
C 0.6835910 0.0801290 -1.2651600
C 0.0385760 -0.2142480 2.4595370
C 2.9039860 0.7584860 -0.4261270
C 1.9882360 0.2606780 -1.4617670
H 3.8887999 0.2924340 -0.3916030
H 2.3965650 0.0209800 -2.4418731
H 0.0600150 -1.2637050 2.1717429
H 0.1919770 0.2365210 -0.3064940
O -0.1368310 -0.2607420 -2.3141041
C -1.0378500 -1.2895941 -2.1994519
O -1.8194150 -1.4546410 -3.0934210
C -0.9520160 -2.1443820 -0.9613080
H 0.0850850 -2.4055369 -0.7287190
H -1.5361210 -3.0478990 -1.1376450
H -1.3763330 -1.6092030 -0.1024920
C 1.0823750 0.3586390 3.0560379
H 1.9862601 -0.2015870 3.2770901
H 1.0818750 1.4086040 3.3402641
C 2.6171989 1.7337860 0.4371280
H 3.3282030 2.0474880 1.1969039
H 1.6705470 2.2685010 0.3915880
C -1.2096530 0.5029830 2.0807769
C -1.4067100 1.9345620 2.5251229
H -0.6585890 2.5824180 2.0530601
H -2.4025259 2.2620981 2.2234540
H -1.2943890 2.0340610 3.6094720
O -2.0432911 -0.0622970 1.3940520
27
 comment 
C 0.6785940 0.0895900 -1.2592160
C 0.0387820 -0.2150840 2.4559050
C 2.9046619 0.7487670 -0.4192210
C 1.9859340 0.2525420 -1.4530050
H 3.8832800 0.2704900 -0.3765630
H 2.3943379 -0.0007300 -2.4296930
H 0.0600560 -1.2667160 2.1762300
H 0.1859720 0.2598610 -0.3034400
O -0.1432220 -0.2500770 -2.3077281
C -1.0374310 -1.2852740 -2.1962149
O -1.8187211 -1.4521340 -3.0900691
C -0.9445940 -2.1436851 -0.9611460
H 0.0943740 -2.3994770 -0.7311130
H -1.5238889 -3.0499580 -1.1391890
H -1.3703721 -1.6133870 -0.1000140
C 1.0848000 0.3637940 3.0426750
H 1.9905159 -0.1934890 3.2636609
H 1.0843771 1.4159710 3.3185790
C 2.6278651 1.7370080 0.4325060
H 3.3405459 2.0497701 1.1910950
H 1.6883790 2.2834129 0.3780430
C -1.2120970 0.4977380 2.0776091
C -1.4083090 1.9328350 2.5109861
H -0.6641670 2.5774081 2.0282159
H -2.4065051 2.2563839 2.2129040
H -1.2890840 2.0420361 3.5936451
O -2.0483761 -0.0737650 1.3993220

So basically it's going to have

{integer}

{comment to throw away}

{integer number of lines with 4 columns, CHAR FLOAT FLOAT FLOAT}

(repeat per number of frames but with different floats on the lines)

I've attempted to write an ANTLR4 grammar file to parse this:

grammar XYZ;
/*
 * Parser Rules
 */

file                : header comment line+ EOF;
line                : ELEMENT FLOAT FLOAT FLOAT NEWLINE;
header              : INT NEWLINE;
comment             : WORD+ NEWLINE;
/*
 * Lexer Rules
*/

fragment LOWERCASE  : [a-z] ;
fragment UPPERCASE  : [A-Z] ;
fragment NUMBER     : [0-9]+ ;
INT                 : NUMBER ;
FLOAT               : '-'? NUMBER '.' NUMBER ;
WORD                : (LOWERCASE | UPPERCASE)+ ;
ELEMENT             : 'A' .. 'Z' ;
WHITESPACE          : (' '|'\t')+ -> skip ;
NEWLINE             : ('\r'? '\n' | '\r')+ ;

I generate the scripts in CMD using the command java -jar antlr-4.7.1-complete.jar -Dlanguage=CSharp XYZ.g4

Then finally in main I have the following code snippet to run the program (input is the above text)

    AntlrInputStream istream = new AntlrInputStream(input);
    XYZLexer lexer = new XYZLexer(istream);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    XYZParser parser = new XYZParser(tokens);
    XYZParser.LineContext lineContext = parser.line();

    Console.WriteLine(lineContext.GetText());
    Console.ReadLine();

What I get is a terminal window that says line 1:0 mismatched input '27' expecting ELEMENT as well as returning the text in input.

using

    XYZParser.FileContext fileContext = parser.file();
    Console.WriteLine(fileContext.GetText());

instead gives me line 3:0 mismatched input 'C' expecting ELEMENT

How am I able to go from this to getting rid of the error and using the ANTLR runtime to get the data?

ANS:

Changing the grammar file to prevent overlap between WORD and ELEMENT

grammar XYZ;
/*
 * Parser Rules
 */

file                : frame+ EOF;
frame               : header comment line+;
line                : ELEMENT FLOAT FLOAT FLOAT NEWLINE;
header              : INT NEWLINE;
comment             : (ELEMENT | WORD+) NEWLINE;
/*
 * Lexer Rules
*/

fragment LOWERCASE  : [a-z] ;
fragment UPPERCASE  : [A-Z] ;
fragment NUMBER     : [0-9]+ ;
INT                 : NUMBER ;
FLOAT               : '-'? NUMBER '.' NUMBER ;
ELEMENT             : 'A' .. 'Z' ;
WORD                : (LOWERCASE | UPPERCASE)+ ;
WHITESPACE          : (' '|'\t')+ -> skip ;
NEWLINE             : ('\r'? '\n' | '\r')+ ;

changing the script to

AntlrInputStream istream = new AntlrInputStream(input);
XYZLexer lexer = new XYZLexer(istream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
XYZParser parser = new XYZParser(tokens);
XYZParser.FileContext fileContext = parser.line();

Console.WriteLine(fileContext.GetText());
Console.ReadLine();

for getting everything

for just seeing one value, an example is

    AntlrInputStream istream = new AntlrInputStream(input);
    XYZLexer lexer = new XYZLexer(istream);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    XYZParser parser = new XYZParser(tokens);

    XYZParser.FileContext fileContext = parser.file();
    XYZParser.FrameContext frameContext = fileContext.frame()[0];
    XYZParser.LineContext lineContext = frameContext.line()[0];

    IParseTree tree = lineContext.FLOAT()[0];
    Console.WriteLine(tree.GetText());
    Console.ReadLine();

Upvotes: 1

Views: 587

Answers (1)

sepp2k
sepp2k

Reputation: 370102

XYZParser.LineContext lineContext = parser.line();

You're trying to apply the rule line, which expects an ELEMENT at the beginning, but your input starts with the number 27, which is an INT, not an ELEMENT. You should apply the rule file instead, which does expect an INT at the beginning.

Upvotes: 2

Related Questions