Linda
Linda

Reputation: 251

Regular expressions in flex has error

I am new in flex and I want to design a scanner using flex.

At this step, I want to make regular expression to match with id, but here are some conditions:

  1. underline can exist in id

  2. you can use _ whenever you want, but if you are using them exactly consequently it can be at most 2 underlines for example :

    a_b_c »»»» true

    a___b »»»» false

    123abv »»»» false

  3. integers can't be at the beginning of an id

  4. underline can't exist at the end of an id

The regular expression I have written for that is :

(\b(_{0,2}[A-Za-z][0-9A-Za-z]*(_{0,2}[0-9A-Za-z]+)*)\b)

but now I have 2 questions:

  1. Is the regular expression true? I have tested it in rubular.com and I think this is true but I'm not sure?

  2. The other important problem is that when I write this in my flex file, Unfortunately no id is identified. But I can't why it is not recognized

Can anyone please help me?

Upvotes: 2

Views: 1114

Answers (2)

brenns10
brenns10

Reputation: 3379

The problem here is your ID regular expression. You are using \b to match a word boundary, but Flex's regular expressions have no built-in support for matching word boundaries. Other than that, your regular expression is sound. I was able to get your code working using this modified version of yours: _{0,2}[A-Za-z][0-9A-Za-z]*(_{0,2}[0-9A-Za-z]+)*. (I just got rid of the \b's, and some of the parentheses that bothered me).

Unfortunately, this causes a slight problem. Say that you're lexing and run across something like 12_345. Flex will read 12, assume that it found an IC, and then read _. Finding no match, it will print that to stdout, then read 345 as another IC.

In order to avoid this issue (caused by Flex's lack of word boundaries), you could do one of two things:

  • Create a rule at the end that matches any character (other than whitespace) and make it give an error. This would stop Flex when it got to _ in the example above.
  • Create a rule at the end that matches any combination of letters, numbers, and underscores ([_0-9A-Za-z]+). If it is matched, give an error. This will cause Flex to return the entire token 12_345 as an error in the above example.

One other problem: The ID regular expression still won't match anything with underscores at the end of it. This means your current regular expression isn't perfect, and you'll need to do some tweaking with it, but now you know not to use the \b symbol. Here is a reference on Flex's regular expression syntax so you can find other things to use/avoid.

Upvotes: 1

rici
rici

Reputation: 241931

I think your requirement is:

  1. Identifiers can use only alphanumeric characters and _

  2. Identifiers cannot start with a number

  3. Identifiers cannot end with an _

  4. Identifiers cannot include more than two consecutive _

(When I first read your question, I thought the last requirement was that identifiers cannot include more than two _, but looking at the proposed regex, I think the version above is more accurate.)

Based on the above, you should be able to use the following two Flex patterns:

([[:alpha:]]|__?[[:alnum:]])(_?_?[[:alnum:]])*  { /* Handle an identifier */ }
[[:alpha:]_][[:alnum:]_]* { /* Error */ }

Breaking that down:

  • ([[:alpha:]]|__?[[:alnum:]]) matches an alphabetic character or one or two _ followed by an alphanumeric character.

  • (_?_?[[:alnum:]])* matches a string of and alphanumeric characters, with a maximum of two before an alphanumeric character.

The second pattern will match anything which starts with an alphabetic character or followed by any number of alphanumerics or . This will match all valid identifiers as well as the sequences which contain too many consecutive or which end with . If both patterns match (that is, a valid identifier), the first one will win, so it will be correctly recognized. The second pattern will consume the entire erroneous identifier, allowing for easier error recovery.

The pattern in the OP doesn't work because flex treats \b as a backspace character (as in C). Flex does not implement word boundary assertions, but in a lexer you almost never need these; the pattern above can be used if necessary.

Upvotes: 0

Related Questions