Reputation: 59
let's say I'm building a compiler and I want the lexical analyzer to recognize integers of the C language, can I specify for example that the integer should be between –2,147,483,648 and 2,147,483,647 that a long integer can be 64 bits? I feel that my question is stupid but I want to know if it's doable... thanks
Upvotes: 2
Views: 421
Reputation: 4897
This regex should work:
-\b(?:214748364[0-8]|21474836[0-3][0-9]|2147483[0-5][0-9]{2}|214748[0-2][0-9]{3}|21474[0-7][0-9]{4}|2147[0-3][0-9]{5}|214[0-6][0-9]{6}|21[0-3][0-9]{7}|20[0-9]{8}|1[0-9]{9}|[1-9][0-9]{1,8}|[0-9])\b|\b(?:214748364[0-7]|21474836[0-3][0-9]|2147483[0-5][0-9]{2}|214748[0-2][0-9]{3}|21474[0-7][0-9]{4}|2147[0-3][0-9]{5}|214[0-6][0-9]{6}|21[0-3][0-9]{7}|20[0-9]{8}|1[0-9]{9}|[1-9][0-9]{1,8}|[0-9])\b
Upvotes: 1
Reputation: 24802
Yes that can be done, but you should not do that!
Spoiler alert: you should better be using strtol
, and I'm telling you why in the long answer.
it can be done using a weirdly crafted regexp (the worst one being a regexp with the list of all integers between MIN and MAX), but you do not want to do such a thing.
This is because such a task would mean a massive processing for the regexp, whereas that test can be done in a very little processing in your favorite language (consider the following as pseudocode):
if (str_to_int(s) > CMIN && str_to_int(s) < CMAX)
Well, actually you might tell me "but if it's an int, it will overflow!". But there are technics to detect that:
and none of them is using a regex!
But anyway, you don't need to go into so much trouble, when there's a function already baked in the C standard library that does that job for you: the strtol
function! Quoting the manual:
The strtol() function returns the result of the conversion, unless the value would underflow or overflow. If an underflow occurs, strtol() returns LONG_MIN. If an overflow occurs, strtol() returns LONG_MAX. In both cases, errno is set to ERANGE. Precisely the same holds for strtoll() (with LLONG_MIN and LLONG_MAX instead of LONG_MIN and LONG_MAX).
Why would it be massive? It's because a regexp is an automaton looking at a stream of characters. When there's a match, you move along the automaton. Basically, you'd need to:
-
2
, can only be followed by 0
or 1
,2
, followed by a 1
, can only be followed by 0
, 1
, 2
, 3
or 4
2
, followed by a 1
and then a 4
, can only be followed by a 1
, 2
, 3
, 4
… 7
2
, folowed by … and ends with a 7
, but if it started with a -
, and then a 2
, it needs to end with a 6
(so basically you have to copy all the previous conditions into another subgraph that ends with that)That would look a bit like the following:
^(
(
\d|\d\d|\d\d\d|\d\d\d\d|\d\d\d\d\d|\d\d\d\d\d\d|
\d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d\d|
[0-2][0-1][0-4][0-7][0-4][0-8][0-3][0-6][0-4][0-8]
)|
-(
\d|\d\d|\d\d\d|\d\d\d\d|\d\d\d\d\d|\d\d\d\d\d\d|
\d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d\d|
[0-2][0-1][0-4][0-7][0-4][0-8][0-3][0-6][0-4][0-7]
)
)$
which is represented visually by the following automaton (click on the image to play with it):
I'm not sure how correct that would be, because I might have missed edge cases, but I hope I made it clear how it compares with doing it in your favorite language. If you actually parse such a huge automaton, it will:
all that instead of doing something that can be done in an operation being 1/100th of the complexity of doing the same thing using a regexp.
So if you don't want to kill a baby seal because of bad programming, don't use a regexp for something it hasn't been designed for.
To better understand what that is an automaton, how regexps are working, when is it a good idea to use and when it's a baby seal killing one, I can only advice you to look at the following courses:
strtol
: Does strtol("-2147483648", 0, 0) overflow if LONG_MAX is 2147483647?Here's the visualization of @Andie2302's answer:
-\b(?:
214748364[0-8]|21474836[0-3][0-9]|2147483[0-5][0-9]{2}|
214748[0-2][0-9]{3}|21474[0-7][0-9]{4}|2147[0-3][0-9]{5}|
214[0-6][0-9]{6}|21[0-3][0-9]{7}|20[0-9]{8}|1[0-9]{9}|
[1-9][0-9]{1,8}|[0-9]|-0
)\b|
\b(?:
214748364[0-7]|21474836[0-3][0-9]|2147483[0-5][0-9]{2}|
214748[0-2][0-9]{3}|21474[0-7][0-9]{4}|2147[0-3][0-9]{5}|
214[0-6][0-9]{6}|21[0-3][0-9]{7}|20[0-9]{8}|1[0-9]{9}|
[1-9][0-9]{1,8}|[0-9]|-0
)\b
through its matching automaton:
Still not convinced?
HTH
Upvotes: 5