Reputation: 10324
I am writing a parse for a script language.
I need to recognize strings
, integers
and floats
.
I successfully recognize strings with the rule:
[a-zA-Z0-9_]+ {return STRING;}
But I have problem recognizing Integers and Floats. These are the (wrong) rules I wrote:
["+"|"-"][1-9]{DIGIT}* { return INTEGER;}
["+"|"-"]["0." | [1-9]{DIGIT}*"."]{DIGIT}+ {return FLOAT;}
How can I fix them?
Furthermore, since a "abc123" is a valid string, how can I make sure that it is recognized as a string and not as the concatenation of a string ("abc
") and an Integer ("123
") ?
Upvotes: 2
Views: 5244
Reputation: 241791
First problem: There's a difference between (...)
and [...]
. Your regular expressions don't do what you think they do because you're using the wrong punctuation.
Beyond that:
No numeric rule recognizes 0
.
Both numeric rules require an explicit sign.
Your STRING rule recognizes integers.
So, to start:
[...]
encloses a set of individual characters or character ranges. It matches a single character which is a member of the set.
(...)
encloses a regular expression. The parentheses are used for grouping, as in mathematics.
"..."
encloses a sequence of individual characters, and matches exactly those characters.
With that in mind, let's look at
["+"|"-"][1-9]{DIGIT}*
The first bracket expression ["+"|"-"]
is a set of individual characters or ranges. In this case, the set contains: ", +, " (again, which has no effect because a set contains zero or one instances of each member), |, and the range "-", which is a range whose endpoints are the same character, and consequently only includes that character, ", which is already in the set. In short, that was equivalent to ["+|]
. It will match one of those three characters. It requires one of those three characters, in fact.
The second bracket expression [1-9]
matches one character in the range 1-9, so it probably does what you expected. Again, it matches exactly one character.
Finally, {DIGIT}
matches the expansion of the name DIGIT
. I'll assume that you have the definition:
DIGIT [0-9]
somewhere in your definitions section. (In passing, I note that you could have just used the character class [:digit:]
, which would have been unambiguous, and you would not have needed to define it.) It's followed by a *
, which means that it will match zero or more repetitions of the {DIGIT}
definition.
Now, an example of a string which matches that pattern:
|42
And some examples of strings which don't match that pattern:
-7 # The pattern must start with |, + or "
42 # Again, the pattern must start with |, + or "
+0 # The character following the + must be in the range [0-9]
Similarly, your float pattern, once the [...]
expressions are simplified, becomes (writing out the individual pieces one per line, to make it more obvious):
["+|] # i.e. the set " + |
["0.|[1-9] # i.e. the set " 0 | [ 1 2 3 4 5 6 7 8 9
{DIGIT}* # Any number of digits
"." # A single period
] # A single ]
{DIGIT}+ # one or more digits
So here's a possible match:
"..]3
I'll skip over writing out the solution because I think you'll benefit more from doing it yourself.
Now, the other issues:
Some rule should match 0
. If you don't want to allow leading zeros, you'll need to just a it as a separate rule.
Use the optional operator (?
) to indicate that the preceding object is optional. eg. "foo"?
matches either the three characters f, o, o (in order) or matches the empty string. You can use that to make the sign optional.
The problem is not the matching of abc123
, as in your question. (F)lex always gives you the longest possible match, and the only rule which could match the starting character a
is the string rule, so it will allow the string rule to continue as long as it can. It will always match all of abc123
. However, it will also match 123
, which you would probably prefer to be matched by your numeric rule. Here, the other (f)lex matching criterion comes into play: when there are two or more rules which could match exactly the same string, and none of the rules can match a longer string, (f)lex chooses the first rule in the file. So if you want to give numbers priority over strings, you have to put the number rule earlier in your (f)lex file than the string rule.
I hope that gives you some ideas about how to fix things.
Upvotes: 4