Reputation: 59

is there a regex to generate all integers for a certain programming language

let's say I'm building a compiler and I want the lexical analyzer to recognize integers of the C language, can I specify for example that the integer should be between –2,147,483,648 and 2,147,483,647 that a long integer can be 64 bits? I feel that my question is stupid but I want to know if it's doable... thanks

Upvotes: 2

Answers (2)

Andie2302

Reputation: 4897

This regex should work:

-\b(?:214748364[0-8]|21474836[0-3][0-9]|2147483[0-5][0-9]{2}|214748[0-2][0-9]{3}|21474[0-7][0-9]{4}|2147[0-3][0-9]{5}|214[0-6][0-9]{6}|21[0-3][0-9]{7}|20[0-9]{8}|1[0-9]{9}|[1-9][0-9]{1,8}|[0-9])\b|\b(?:214748364[0-7]|21474836[0-3][0-9]|2147483[0-5][0-9]{2}|214748[0-2][0-9]{3}|21474[0-7][0-9]{4}|2147[0-3][0-9]{5}|214[0-6][0-9]{6}|21[0-3][0-9]{7}|20[0-9]{8}|1[0-9]{9}|[1-9][0-9]{1,8}|[0-9])\b

Upvotes: 1

zmo

Reputation: 24802

Short answer

Yes that can be done, but you should not do that!

Spoiler alert: you should better be using strtol, and I'm telling you why in the long answer.

Long answer

it can be done using a weirdly crafted regexp (the worst one being a regexp with the list of all integers between MIN and MAX), but you do not want to do such a thing.

This is because such a task would mean a massive processing for the regexp, whereas that test can be done in a very little processing in your favorite language (consider the following as pseudocode):

if (str_to_int(s) > CMIN && str_to_int(s) < CMAX)

Well, actually you might tell me "but if it's an int, it will overflow!". But there are technics to detect that:

How to detect integer overflow?

and none of them is using a regex!

But anyway, you don't need to go into so much trouble, when there's a function already baked in the C standard library that does that job for you: the strtol function! Quoting the manual:

The strtol() function returns the result of the conversion, unless the value would underflow or overflow. If an underflow occurs, strtol() returns LONG_MIN. If an overflow occurs, strtol() returns LONG_MAX. In both cases, errno is set to ERANGE. Precisely the same holds for strtoll() (with LLONG_MIN and LLONG_MAX instead of LONG_MIN and LONG_MAX).

Why would it be massive? It's because a regexp is an automaton looking at a stream of characters. When there's a match, you move along the automaton. Basically, you'd need to:

match any string of 10 characters, or 11 only if it starts with a -
that contains only digits,
that if it starts with a 2, can only be followed by 0 or 1,
that if it starts with a 2, followed by a 1, can only be followed by 0, 1, 2, 3 or 4
that if it starts with a 2, followed by a 1 and then a 4, can only be followed by a 1, 2, 3, 4 … 7
…
if it starts with a 2, folowed by … and ends with a 7, but if it started with a -, and then a 2, it needs to end with a 6 (so basically you have to copy all the previous conditions into another subgraph that ends with that)
and for any other character it's a match.

That would look a bit like the following:

^(
  (
   \d|\d\d|\d\d\d|\d\d\d\d|\d\d\d\d\d|\d\d\d\d\d\d|
   \d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d\d|
   [0-2][0-1][0-4][0-7][0-4][0-8][0-3][0-6][0-4][0-8]
  )|
  -(
    \d|\d\d|\d\d\d|\d\d\d\d|\d\d\d\d\d|\d\d\d\d\d\d|
    \d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d\d|
    [0-2][0-1][0-4][0-7][0-4][0-8][0-3][0-6][0-4][0-7]
   )
 )$

which is represented visually by the following automaton (click on the image to play with it):

I'm not sure how correct that would be, because I might have missed edge cases, but I hope I made it clear how it compares with doing it in your favorite language. If you actually parse such a huge automaton, it will:

burn CPU time,
burning electricity,
burning (fuel|coal|gaz|uranium),
polluting the planet,
killing a baby seal

all that instead of doing something that can be done in an operation being 1/100th of the complexity of doing the same thing using a regexp.

cute baby seal

So if you don't want to kill a baby seal because of bad programming, don't use a regexp for something it hasn't been designed for.

Resources

To better understand what that is an automaton, how regexps are working, when is it a good idea to use and when it's a baby seal killing one, I can only advice you to look at the following courses:

Here's the visualization of @Andie2302's answer:

-\b(?:
     214748364[0-8]|21474836[0-3][0-9]|2147483[0-5][0-9]{2}|
     214748[0-2][0-9]{3}|21474[0-7][0-9]{4}|2147[0-3][0-9]{5}|
     214[0-6][0-9]{6}|21[0-3][0-9]{7}|20[0-9]{8}|1[0-9]{9}|
     [1-9][0-9]{1,8}|[0-9]|-0
 )\b|
 \b(?:
     214748364[0-7]|21474836[0-3][0-9]|2147483[0-5][0-9]{2}|
     214748[0-2][0-9]{3}|21474[0-7][0-9]{4}|2147[0-3][0-9]{5}|
     214[0-6][0-9]{6}|21[0-3][0-9]{7}|20[0-9]{8}|1[0-9]{9}|
     [1-9][0-9]{1,8}|[0-9]|-0
 )\b

through its matching automaton:

Still not convinced?

HTH

Upvotes: 5

is there a regex to generate all integers for a certain programming language

Answers (2)

Short answer

Long answer

Resources

Related Questions