Lexer/Parser design for data file

Question

I am writing a small program which needs to preprocess some data files that are inputs to another program. Because of this I can't change the format of the input files and I have run into a problem.

I am working in a language that doesn't have libraries for this sort of thing and I wouldn't mind the exercise so I am planning on implementing the lexer and parser by hand. I would like to implement a Lexer based roughly on this which is a fairly simple design.

The input file I need to interpret has a section which contains chemical reactions. The different chemical species on each side of the reaction are separated by '+' signs, but the names of the species can also have + characters in them (symbolizing electric charge). For example:

N2+O2=>NO+NO
N2++O2-=>NO+NO
N2+ + O2 => NO + NO

are all valid and the tokens output by the lexer should be

'N2' '+' 'O2' '=>' 'NO' '+' 'NO'
'N2+' '+' 'O2-' '=>' 'NO' '+' 'NO'
'N2+' '+' 'O2-' '=>' 'NO' '+' 'NO'

(note that the last two are identical). I would like to avoid look ahead in the lexer for simplicity. The problem is that the lexer would start reading the any of the above inputs, but when it got to the 3rd character (the first '+'), it wouldn't have any way to know whether it was a part of the species name or if it was a separator between reactants.

To fix this I thought I would just split it off so the second and third examples above would output:

'N2' '+' '+' 'O2-' '=>' 'NO' '+' 'NO'

The parser then would simply use the context, realize that two '+' tokens in a row means the first is part of the previous species name, and would correctly handle all three of the above cases. The problem with this is that now imagine I try to lex/parse

N2 + + O2- => NO + NO

(note the space between 'N2' and the first '+'). This is invalid syntax, however the lexer I just described would output exactly the same token outputs as the second and third examples and my parser wouldn't be able to catch the error.

So possible solutions as I see it:

implement a lexer with atleast one character look ahead
include tokens for whitespace
include leading white space in the '+' token
create a "combined" token that includes both the species name and any trailing '+' without white space between, then letting the parser sort out whether the '+' is actually part of the name or not.

Since I am very new to this kind of programming I am hoping someone can comment on my proposed solutions (or suggest another). My main reservation about the first solution is I simply do not know how much more complicated implementing a lexer with look ahead is.

500 - Internal Server Error · Accepted Answer

You don't mention your implementation language, but with an input syntax as relatively simple as the one you outline, I don't think having logic along the lines of the following pseudo-code would be unreasonable.

string GetToken()
{
  string token = GetAlphaNumeric(); // assumed to ignore (eat) white-space

  var ch = GetChar(); // assumed to ignore (eat) white-space

  if (ch == '+') 
  { 
     var ch2 = GetChar(); 

     if (ch2 == '+')
       token += '+';
     else
       PutChar(ch2); 
  }

  PutChar(ch);

  return token;
}

Lexer/Parser design for data file

Answers (1)

Related Questions