Reputation: 209
I want to create a grammar and lexer to parse the below string:
100 reason phrase
regular expression will be: "\d{3} [^\r\n]*"
token definition:
template <typename Lexer>
struct custom_tokens : lex::lexer<Lexer>
{
custom_tokens()
{
this->self.add_pattern
("STATUSCODE", "\\d{3}")
("SP", " ")
("REASONPHRASE", "[^\r\n]*")
;
this->self.add
("{STATUSCODE}", T_STATUSCODE)
("{SP}", T_SP)
("{REASONPHRASE}", T_REASONPHRASE)
;
}
};
grammar:
template <typename Iterator>
struct custom_grammar : qi::grammar<Iterator >
{
template <typename TokenDef>
custom_grammar(TokenDef const& tok)
: custom_grammar::base_type(start)
{
start = (qi::token(T_STATUSCODE) >> qi::token(T_SP) >> qi::token(T_REASONPHRASE));
}
qi::rule<Iterator> start;
};
however, I realized that I couldn't define token "T_REASONPHRASE" because it will match everything including "T_STATUSCODE". what I can do is
undefine T_REASONPHRASE and use qi::lexeme to write a rule inside custom_grammar?
can I use lex state to do that? e.g. define "T_REASONPHRASE" in second state, if it sees T_STATUSCODE in first state then parse the rest to second state? please give an example?
Upvotes: 1
Views: 180
Reputation: 393789
I don't think there really is a problem, because tokens are 'greedily' matched in the order they've been added to the token definitions (for a specific lexer state).
So, given
this->self.add
("{STATUSCODE}", T_STATUSCODE)
("{SP}", T_SP)
("{REASONPHRASE}", T_REASONPHRASE)
;
T_STATUSCODE will always match before T_REASONPHRASE (if there is an ambiguity at all).
About using separate Lexer states, here's an excerpt of a tokenizer I once had in a toy project:
this->self = fileheader [ lex::_state = "GT" ];
this->self("GT") =
gametype_label |
gametype_63000 | gametype_63001 | gametype_63002 |
gametype_63003 | gametype_63004 | gametype_63005 |
gametype_63006 |
gametype_eol [ lex::_state = "ML" ];
this->self("ML") = mvnumber [ lex::_state = "MV" ];
this->self("MV") = piece | field | op | check | CASTLEK | CASTLEQ
| promotion
| Checkmate | Stalemate | EnPassant
| eol [ lex::_state = "ML" ]
| space [ lex::_pass = lex::pass_flags::pass_ignore ];
(The purpose would be relatively clear if you read GT
as gametype, ML
: move line and MV
: move; Note the presence of eol
and gametype_eol
here: Lex disallows adding the same token to different states)
Upvotes: 2