Reputation: 7471
I have a text file which I'm trying to parse. The file looks like this:
A_1: - A_10:
Some text.
----------
Some more text.
__________
B_1: - B_5:
Still more text - it may contain dashes as well.
----------
Even more text. Could be multiple sentences with newlines.
Like this.
__________
And so on.
I'm trying to separate parsing/tokenization between bison
and flex
. I've managed to parse the header (A_1: - A_10:
) using the following regular expressions in flex
:
[ \t]+ ; // ignore whitespace
[A-Z]_[0-9]+(_[0-9]+)? { ... return ID; }
in combination with a rule in my grammar to combine the two IDs:
header: ID ':' '-' ID ':'
However, the next segment of text is causing some trouble. I'm pretty sure I need to include start conditions in the lexer (e.g. to only ignore whitespace when parsing the header). I've tried to define a token TEXT and parse the whole thing up until ----------
as a single token. Still can't figure out if this is a reasonable thing to do.
The other possibility I can think of is to have a rule in the grammar which would combine the text segment using tokens like WORD, SPACE, DASH, NEWLINE and every other possible character. Does it even make sense?
So now I'm stuck trying to parse those text segments. Am I using the right tools for the job. I will appreciate your help, thanks.
Upvotes: 0
Views: 487
Reputation: 126175
This is what lex start states were designed for. Basically, you declare a start state for each different language you need to deal with (two in your case -- headers and bodies) and then tag the rules based on which state they apply to. So you'd want something like:
%s header
%s body
%%
<header>[ \t\n] ; /* ignore */
<header>[_a-zA-Z][_a-zA-Z0-9]* { ... return ID; }
<header>[-:,.;()] { return *yytext; }
<body>^----------$ { yylval.text = GetStoredText(); return SECTION_SPLIT; }
<body>^__________$ { yylval.text = GetStoredText(); return SECTION_END; }
<body>. { StoreText(*yytext);
%%
void BeginHeader() { BEGIN header; }
void BeginBody() { BEGIN body; }
Where StoreText
is a function that stores characters into a buffer (something like a std::stringstream if you're using C++) and GetStoredText
returns all the text stored since the last call and clears the buffer. Then your yacc/bison code will look something like:
input: entry | input entry ;
entry: header body ;
header: ..something to match a header.. { BeginBody(); };
body: sections SECTION_END { BeginHeader(); };
sections: /*empty*/ | sections SECTION_SPLIT ;
Of course, you'll also want code to do whatever its you want with the contents of the body sections...
Upvotes: 3
Reputation: 7471
I've come to realization that processing such a document line-by-line (i.e. writing your own parser suitable for the task) may yield a much cleaner solution.
Another possibility is to split the whole file at each __________
marker. In this way we'll obtain a number of sections. Then split each section at ----------
markers. Now we are able to extract the second chunk of text of the current section ("Some more text." in the first section in the example above). The first chunk of text is simply a one-line header followed by "Some text." until the end of chunk.
This algorithm is easily implemented in a scripting language like Perl, Python, Ruby, etc.
Upvotes: 0