Baradé
Baradé

Reputation: 1332

Boost Spirit Qi parser does not consume whole string expression?

Assuming I have the following rule:

identifier %= 
        lexeme[
            char_("a-zA-Z")
            >> -(*char_("a-zA-Z_0-9")
            >> char_("a-zA-Z0-9"))
        ]
        ;

qi::rule<Iterator, std::string(), Skipper> identifier;

and the following input:

// identifier
This_is_a_valid123_Identifier

As the traces show the identifier is parsed properly and the attributes are set but the skipper starts one char after the first character of the string again:

<identifier>
  <try>This_is_a_valid123_I</try>
  <skip>
    <try>This_is_a_valid123_I</try>
    <emptylines>
      <try>This_is_a_valid123_I</try>
      <fail/>
    </emptylines>
    <comment>
      <try>This_is_a_valid123_I</try>
      <fail/>
    </comment>
    <fail/>
  </skip>
  <success>his_is_a_valid123_Id</success>
  <attributes>[[T, h, i, s, _, i, s, _, a, _, v, a, l, i, d, 1, 2, 3, _, I, d, e, n, t, i, f, i, e, r]]</attributes>
</identifier>
<skip>
  <try>his_is_a_valid123_Id</try>
  <emptylines>
    <try>his_is_a_valid123_Id</try>
    <fail/>
  </emptylines>
  <comment>
    <try>his_is_a_valid123_Id</try>
    <fail/>
  </comment>
  <fail/>
</skip>

I've already tried to use as_string in the lexeme expression which did not help.

Upvotes: 2

Views: 633

Answers (1)

sehe
sehe

Reputation: 393064

I don't see why you complicate the expression. Can you try

identifier %= 
                char_("a-zA-Z")
            >> *char_("a-zA-Z_0-9")
        ;

qi::rule<Iterator, std::string()> identifier;

This is about the most standard expression you can get. Even if you don't want to allow identifiers ending in _ I'm quite sure you don't want such a trailing _ to be parsed as 'the next token'. In such a case, I'd just add validation after the parse.

Update To the comment:

Here is the analysis:

  • First off: -(*x) is a red flag. It is never a useful pattern as *x already matches an empty sequence, you can't make it "more optional"

    (in fact, if *x was made to allow partial backtracking as in regular expression, you'd likely have seen exponential performance or even infite runtime; "luckily", *x is always greedy in Spirit Qi).

This indeed facilitates your bug. Let's look at your parser expression in the OP as lines 1, 2, 3.

  • First, Line 1 matches T.
  • The second line initially greedily matches his_is_a_valid123_Identifier.
  • But that cannot satisfy the third line, so the -(...) kicks in and everything after line 1 is backtracked.
  • However: Qi

    • does backtrack the cursor (current input iterator) but
    • does not by default rollback changes to container attributes.

    Yes. You guessed it. std::string is a container attribute.

So what you end up is a succeeded match with length 1 and residu of a failed optional sequence in the attribute.

Some other backgrounders on how to resolve this kind of backtracking issue:

Upvotes: 4

Related Questions