Roger Costello
Roger Costello

Reputation: 3209

Using Flex to treat XML Schema xs:annotation elements as comments

I am experimenting with Flex to tokenize an XML Schema file. I would like to treat the <xs:annotation> element as a comment. Here is an example of an <xs:annotation> element in an XML Schema:

<xs:annotation>
    <xs:documentation>This is a comment for humans</xs:documentation>
    <xs:appinfo>This is a comment for machines</xs:appinfo>
</xs:annotation>

I am following the example on page 38 of the Flex&Bison book and am using a COMMENT state. Here is the approach that I am taking: begin a comment upon encountering <xs:annotation>

"<xs:annotation>"    { BEGIN(COMMENT) ; }

Switch state upon encountering the end tag </xs:annotation>

<COMMENT>"</xs:annotation>"  { BEGIN(INITIAL); }

The comment that lies between the xs:annotation start tag and end tag is any character except <, or < followed by any character except /, or </ followed by any character except x, or </x followed by any character except s, or </xs followed by any character except :, or </xs: followed by any character except a, or </xs:a followed by any character except n

<COMMENT>([^<]|<[^/]|</[^x]|</x[^s]|</xs[^:]|</xs:[^a]|</xs:a[^n])+

Unfortunately, Flex gives this error message:

unrecognized rule

What am I doing wrong, please?

Upvotes: 0

Views: 126

Answers (1)

rici
rici

Reputation: 241741

/ is a regex operator in (f)lex ("trailing context"). To use it as a literal character, it must be quoted, escaped, or included in a character class. So any of these can be used to represent the fixed string </xs:

<"/"xs:
"</xs:"
<\/xs:
<[/]xs:

If you're new to (f)lex, you might not be familiar with features like the trailing context operator or quoted strings, which are not implemented in most regex libraries. You might want to take a few minutes to read the flex manual section on pattern syntax. It's not long.


The simplest template to recognise a text between opening and closing delimiters is the following:

%x COMMENT
  // ...
%%
"<xs:annotation>"            { BEGIN(COMMENT) ; }
<COMMENT>"</xs:annotation>"  { BEGIN(INITIAL); }
<COMMENT>.|\n                ;
<COMMENT><<EOF>>             { fputs("Unterminated comment\n", stderr): }

(All but the second last line are identical to what you already have.)

It's worth noting how this works. The pattern .|\n matches absolutely any single character. But it won't interfere with the detection of the closing delimiter because of the so-called "maximal munch" rule, which dictates that the lexical scanner always accepts the longest possible match at each point. That's the rule that (f)lex-generated scanners use to decide which of two matching patterns applies. (If more than one rule produce the same longest match, the scanner choses the one which appears first in the file.)

So the .|\n pattern won't match when the input starts with </xs:annotation>, because there is (much) longer match which applies.

You could just stop there. But, as John Levine points out in Flex and Bison, this is not the most efficient solution, because every individual character is handled as a single token. Even though there is no action associated with the token, it still incurs all the overhead associated with a token match. So it makes sense to add an additional rule which matches a longer sequence; however, this longer pattern must not interfere with the recognition of the closing delimiter.

For example, we could add the rule:

<COMMENT>[^<]+                ;

which will match any sequence of characters not including a <, thereby moving over the text with many fewer tokens. (In Flex and Bison, the equivalent rule is written ([^*]|\n)+ but the explicit newline match is unnecessary, since in (f)lex negated character classes do match a newline, unless the newline is mentioned in the set.)

But note that we still need a rule which will match a single character, because the above won't match <. Since the two rules have the same action (which is to do nothing), we could combine them:

<COMMENT>[^<]+|.              ;

That's almost certainly where you should stop reading :-) But you seem to have been seeking to extend this optimisation to match other longer sequences starting with < which don't match the closing delimiter, so I'll note that this optimisation can be extended but this needs to be done with some care. For example, if we were to write:

<COMMENT>([^<]+|<[^/])|.      ;

we would discover (soon, if we wrote sufficient unit tests :-) ) that the scanner fails to recognise

<xs:annotation>This is not valid XML: <</xs:annotation>

That's probably not a huge issue for your project, but it needs to be taken into account when trying to hand-write negative regular expressions. The correct extension would be:

<COMMENT>([^<]+|<[^/<])|.     ;

In reality, the overhead incurred by an extra token match is really minor, and it's not worth going to heroic efforts to avoid it. The simple version is almost certainly good enough for all practical purposes. However, it does impose a different lexical overhead, because it forces the scanner to fall back to a previous position every time it encounters a close tag other than the closing delimiter. The problem with the fallback is not so much the cost of falling back -- here, as in most cases, it's infrequent -- so much as that the existence of a fallback anywhere in the lexical rules prevents flex from applying an optimisation to the lexer. (In a scanner without fallback, it's not necessary to check at every input position whether the rule could match at that point in order to save fallback information.)

To eliminate fallback, we could add a rule which matches any prefix of the closing delimiter. (Since < does not appear in the closing delimiter after the start, we don't have to worry about possible overlaps, as above. This won't be the case for all possible closing delimiters.)

<COMMENT><("/"(x(s(:(a(n(n(o(t(a(t(i(on?)?)?)?)?)?)?)?)?)?)?)?)?)?   ;

But note that it's only worth doing that if there are no fallbacks elsewhere in the lexical grammar. So if you don't feel like meticulously counting parentheses, don't worry about it.

Upvotes: 1

Related Questions