Alexei Sholik
Alexei Sholik

Reputation: 7471

How to separate responsibilities between parser and lexer to match a block of text?

I have a text file which I'm trying to parse. The file looks like this:

A_1: - A_10:
Some text.
----------
Some more text.
__________

B_1: - B_5:
Still more text - it may contain dashes as well.
----------
Even more text. Could be multiple sentences with newlines.
Like this.
__________

And so on.

I'm trying to separate parsing/tokenization between bison and flex. I've managed to parse the header (A_1: - A_10:) using the following regular expressions in flex:

[ \t]+ ;               // ignore whitespace
[A-Z]_[0-9]+(_[0-9]+)? { ... return ID; }

in combination with a rule in my grammar to combine the two IDs:

header:        ID ':' '-' ID ':'

However, the next segment of text is causing some trouble. I'm pretty sure I need to include start conditions in the lexer (e.g. to only ignore whitespace when parsing the header). I've tried to define a token TEXT and parse the whole thing up until ---------- as a single token. Still can't figure out if this is a reasonable thing to do.

The other possibility I can think of is to have a rule in the grammar which would combine the text segment using tokens like WORD, SPACE, DASH, NEWLINE and every other possible character. Does it even make sense?

So now I'm stuck trying to parse those text segments. Am I using the right tools for the job. I will appreciate your help, thanks.

Upvotes: 0

Views: 487

Answers (2)

Chris Dodd
Chris Dodd

Reputation: 126175

This is what lex start states were designed for. Basically, you declare a start state for each different language you need to deal with (two in your case -- headers and bodies) and then tag the rules based on which state they apply to. So you'd want something like:

%s header
%s body
%%
<header>[ \t\n]                   ; /* ignore */
<header>[_a-zA-Z][_a-zA-Z0-9]*    { ... return ID; }
<header>[-:,.;()]                 { return *yytext; }

<body>^----------$                { yylval.text = GetStoredText(); return SECTION_SPLIT; }
<body>^__________$                { yylval.text = GetStoredText(); return SECTION_END; }
<body>.                           { StoreText(*yytext);
%%
void BeginHeader() { BEGIN header; }
void BeginBody() { BEGIN body; }

Where StoreText is a function that stores characters into a buffer (something like a std::stringstream if you're using C++) and GetStoredText returns all the text stored since the last call and clears the buffer. Then your yacc/bison code will look something like:

input: entry | input entry ;
entry: header body ;
header: ..something to match a header.. { BeginBody(); };
body: sections SECTION_END { BeginHeader(); };
sections: /*empty*/ | sections SECTION_SPLIT ;

Of course, you'll also want code to do whatever its you want with the contents of the body sections...

Upvotes: 3

Alexei Sholik
Alexei Sholik

Reputation: 7471

I've come to realization that processing such a document line-by-line (i.e. writing your own parser suitable for the task) may yield a much cleaner solution.

Another possibility is to split the whole file at each __________ marker. In this way we'll obtain a number of sections. Then split each section at ---------- markers. Now we are able to extract the second chunk of text of the current section ("Some more text." in the first section in the example above). The first chunk of text is simply a one-line header followed by "Some text." until the end of chunk.

This algorithm is easily implemented in a scripting language like Perl, Python, Ruby, etc.

Upvotes: 0

Related Questions