Flex scanning, differentiating between string (with single spaces) and padding (more than one space)

Question

I am having trouble with flex to scan lines that looks something like this

DESCRIPTION                    This is the device description

I would like the line to be scanned such that DESCRIPTION is one token and "This is the device description" is the other.

I have been playing endlessly with my rules but cannot seem to get it to work.

From the documentation I think I want to implement a rule using

`r/s' an r but only if it is followed by an s

where spaces are only accepted is they are followed by something that is not a while space. I have no idea how to write this rule with flex's syntax. In my mind the rule should be something like

[a-zA-Z](" "/[a-zA-Z0-9]|[a-zA-Z0-9])*        return IDENTIFIER;

But this is invalid.

I can get the lines to chop up each word but I cannot get the rules to differentiate between 1 space and 1 < spaces. Halp.

rici · Accepted Answer

This is not really a good match for flex, since the recognition of tokens is context-dependent. You can achieve context-dependent scanning using start conditions but excessive use of start conditions is often an indication that some other scanning mechanism would be better.

Regardless of how you do it, the key is figuring out exactly how to decide on the token division. Consider the following four lines, for example:

DEVICE      This is the device
MODE        This is the mode
DESCRIPTION This is the device description
UNDOCUMENTED FIELD

Of course, it is possible that the corner cases represented by the third and fourth lines never show up in any of your inputs.

If the first token cannot include whitespace, then the problem is relatively simple, although you still need a start condition (and I'm going to assume you read the documentation linked above):

%x WHITE WORDS
%%
  /* Possibly should be [[:alpha:]] instead of [[:upper:]] */
[[:upper:]]+   { /* copy yytext */; BEGIN(WHITE); return KEYWORD; }
  /* Handle other possible line beginnings */

      { /* Blank descriptive text */; BEGIN(INITIAL); }
[ 	]+  { BEGIN(WORDS); }
.       { /* Something not correct in this line */; ... }
.+      { /* copy yytext */; BEGIN(INITIAL); return DESCRIPTION; }

      { BEGIN(INITIAL); }

If there might be whitespace in the first token but never two spaces in a row, you could replace the first pattern above with:

[[:alpha:]]+( [[:alpha:]]+)*

which will match any sequence of words (consisting only of letters) where there is exactly one space between successive words. Like the original pattern above, this will end on the first non-alphabetic character found. That error will be detected by the rules in , because any non-whitespace character encountered when that start condition becomes active will be handled by the start condition's default rule (the . rule).

Flex scanning, differentiating between string (with single spaces) and padding (more than one space)

Answers (2)

Related Questions