How to separate responsibilities between parser and lexer to match a block of text?

Question

I have a text file which I'm trying to parse. The file looks like this:

A_1: - A_10:
Some text.
----------
Some more text.
__________

B_1: - B_5:
Still more text - it may contain dashes as well.
----------
Even more text. Could be multiple sentences with newlines.
Like this.
__________

And so on.

I'm trying to separate parsing/tokenization between bison and flex. I've managed to parse the header (A_1: - A_10:) using the following regular expressions in flex:

[ 	]+ ;               // ignore whitespace
[A-Z]_[0-9]+(_[0-9]+)? { ... return ID; }

in combination with a rule in my grammar to combine the two IDs:

header:        ID ':' '-' ID ':'

However, the next segment of text is causing some trouble. I'm pretty sure I need to include start conditions in the lexer (e.g. to only ignore whitespace when parsing the header). I've tried to define a token TEXT and parse the whole thing up until ---------- as a single token. Still can't figure out if this is a reasonable thing to do.

The other possibility I can think of is to have a rule in the grammar which would combine the text segment using tokens like WORD, SPACE, DASH, NEWLINE and every other possible character. Does it even make sense?

So now I'm stuck trying to parse those text segments. Am I using the right tools for the job. I will appreciate your help, thanks.

Chris Dodd · Accepted Answer

This is what lex start states were designed for. Basically, you declare a start state for each different language you need to deal with (two in your case -- headers and bodies) and then tag the rules based on which state they apply to. So you'd want something like:

%s header
%s body
%%
[ 	
]                   ; /* ignore */
[_a-zA-Z][_a-zA-Z0-9]*    { ... return ID; }
[-:,.;()]                 { return *yytext; }

^----------$                { yylval.text = GetStoredText(); return SECTION_SPLIT; }
^__________$                { yylval.text = GetStoredText(); return SECTION_END; }
.                           { StoreText(*yytext);
%%
void BeginHeader() { BEGIN header; }
void BeginBody() { BEGIN body; }

Where StoreText is a function that stores characters into a buffer (something like a std::stringstream if you're using C++) and GetStoredText returns all the text stored since the last call and clears the buffer. Then your yacc/bison code will look something like:

input: entry | input entry ;
entry: header body ;
header: ..something to match a header.. { BeginBody(); };
body: sections SECTION_END { BeginHeader(); };
sections: /*empty*/ | sections SECTION_SPLIT ;

Of course, you'll also want code to do whatever its you want with the contents of the body sections...

How to separate responsibilities between parser and lexer to match a block of text?

Answers (2)

Related Questions