Reputation: 627
Is there a way to limit a regular expression to 100 characters with a regular expression?
\[size=(.*?)\](.*?)\[\/size]
So Look at me!
wouldn't work.
I want to limit the numbers, only allow numbers between 1 and 100.
Upvotes: 27
Views: 188435
Reputation: 10500
This is a sequel to my previous answer that makes use of .NET's balancing groups.
Well... also yes. I just discovered yet another beautiful (or horrible) PCRE2 feature: Non-atomic lookaround assertions.
That's a great question! Thanks for asking! (Although even if you didn't, I'll just pretend that you did.)
Non-atomic lookarounds are essentially lookarounds, but they allow backtracking, unlike normal lookarounds which are atomic by default.
In a dialect of regex math that I just invented, the relationships between them can be represented by the following equations:
(?*)
= (?=)
- (?>)
(?<*)
= (?<=)
- (?>)
Of course! Here's a simple task:
Given the following string, match every (non-overlapping) substring that:
- Starts and ends with
@
- As long as possible, yet must not longer than 18 characters, and
- The number of
-
between them must not exceed 10
@@--@---@@-@-@----@--@-@---@-@@-@-@-@@-@-@-@----@-@-@-@-@-@-@-@--@-@-@-@---@-@@-@--@--@--@----@
The expected result would be:
[@@--@---@@-@-@]----[@--@-@---@-@@-@-@]-[@@-@-@-@----@-@-@]-[@-@-@-@-@--@-@-@-@]---[@-@@-@--@--@--@]----@
If we were to use .NET, we can easily work out something like this from what we learnt in the previous answer:
(?<w>){16}
(?<h>){10}
@(?<-w>@|(?<-h>-))+@
Try it on regex101.com.
First, we want something that starts with @
and ends with @
. If we were to verify if a full string match our requirements or not, this answer has told us that we can use a lookahead preceding to the actual expression to ensure its total length. The following matches @@--@---@@-@-@
but not @--@-@---@-@@-@-@-@@
:
(?=.{2,18}$)
^@(?:@*-){0,10}@*@$
Try it on regex101.com.
That said, we can also swap the two expressions:
(?=@(?:@*-){0,10}@*@$)
^.{2,18}$
Try it on regex101.com.
We were using $
as an anchor to apply both expressions to the same piece of text. However, we can't use it in our original use case. Instead, we need to create our own anchor by adding a backreference:
(?= # Keep the current position in mind, skip forward to
@(?:@*-){0,10}@*@ # the rightmost '@' that is at most 10 hyphens away
# from the current one (the one following this)
(.*) # and capture anything following it (until the end).
) # Back to current,
.{2,18} # then match the following 2 to 18 chars iff
(?=\1$) # they are themself followed by what we captured in group 1.
Try it on regex101.com.
This matches:
[@@--@---@@-@-@]----@--@-@---[@-@@-@-@-@@-@-@-@]----@-@-@-[@-@-@-@-@--@-@-@-@]---[@-@@-@--@--@--@]----@
Finally some results! However, the matches are not the same as what we expected. The engine skipped three @
before finding the second match. Why?
The first match is correct, so let's proceed to the @
nearest to it:
...----@--@-@---@-@@-@-@-@@-@-@-@---...
^
The lookahead then advances to:
1 2 3 4 56 7 8 9
...----@--@-@---@-@@-@-@-@@-@-@-@---...
^<group 1...>
It did its job well: the expression inside it matched, everything following the 10th @
is captured into group 1. However, it's the second expression (.{2,18}
) that didn't match this 20-character substring:
1 2 3 4 56 7 8 9
...----@--@-@---@-@@-@-@-@@-@-@-@---...
[ 20 chars ]<group 1...>
This caused the whole match to fail, and the engine continued to the 2nd @
instead of returning to the lookahead to backtrack to the 9th @
. We don't want that. Luckily, this is precisely what a non-atomic lookahead is for:
(?*
@(?:@*-){0,10}@*@
(.*)
)
.{2,18}(?=\1$)
Try it on regex101.com.
What's the difference between this one and the previous? Very subtle: (?*)
instead of (?=)
. This time, the engine backtracks to the 9th, and then the 8th, where it finally finds a match. The rest of the matches are also found this way.
It follows this pattern:
$
/\z
or similar.Do note that you need to be careful when choosing the first expression. For example, the following matches in ~98k steps against the problem I gave in my last answer:
(?*\[size=\d+].*?\[/size](.*))
(?=.{24,40}\1$)
.{0,39}t.*?(?=\1$)
Try it on regex101.com.
...whereas this one, albeit look very similar, can't even reach the same conclusion:
(?*.{24,40}(.*))
(?=\[size=\d+].*?\[/size]\1$)
.{0,39}t.*?(?=\1$)
Try it on regex101.com.
(Due to the nature of the text, (?*)
can be replaced with (?=)
for much better performance: ~3k steps.)
A much stronger no. This trick is even worse than the .NET one since you will also need to test your regex extensively to make sure that you chose a good first expression.
Knowing about these tricks is a good thing, using them in production code is not.
Upvotes: 1
Reputation: 10500
If you are using PCRE2, see this answer which makes use of non-atomic lookaheads.
If you are using .NET, then... yes, it is possible with balancing groups.
For example, let's say we want to match all instances of \[size=(.*?)](.*?)\[/size]
in the following paragraph (generated by ChatGPT), where the total length, including those BBCode tags, does not exceed 40. Those instances are marked in bold and italic for you to see:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed condimentum velit a felis commodo, ac efficitur nunc [size=27]Hello World[/size] dictum. Fusce auctor, [size=30]This is a test[/size] sit amet hendrerit commodo, [size=59]The quick brown fox jumps over the lazy dog[/size] enim augue consectetur nulla, vel blandit magna est vel sapien. Nam [size=42]Lorem ipsum dolor sit amet[/size] mattis ligula eu [size=41]Happiness is a warm puppy[/size] condimentum rhoncus. [size=21]Short[/size] Interdum et malesuada fames ac ante ipsum primis in faucibus. Donec interdum, [size=50]Supercalifragilisticexpialidocious[/size] ut posuere sapien venenatis vel.
We will come back to this in a while.
A capturing group ((?<group>)
) in .NET is basically a counter. We decrease that counter with a balancing group ((?<-group>)
, note the -
). This counter cannot be negative; we can't decrease it more than the number of stacks it has. For more information, see this answer.
The main idea is to count our way to the given maximum length before decreasing that count until it reaches 0. That being said, this is another way to write ^.{40}
:
^
(?<counter>){40}
(?<-counter>.)+
+
is supposed to match as many as there are to match. However, since counter
has only 40 stacks, +
is prevented from matching the 41st character. You are right, this is a silly example. However, we can take a step further.
As you can see, the regex \[size=(.*?)](.*?)\[/size]
has some concrete parts and some dynamic parts. Needless to say, the maximum length of the dynamic parts is equals to the maximum length overall minus the concrete length.
Let Lc, Ld and Lx be the length of the concrete parts, the dynamic parts and the maximum length overall, correspondingly, this is the relationship between them in terms of inequations:
Lc + Ld <= Lx
Ld <= Lx - Lc
We know what Lx is: 40. We also know what Lc is: The length of [size=][/size]
, or 14. This means Ld is smaller than or equal to 40 - 14 = 26. In regex terms, that goes as follows:
(?<counter>){26} # Push 26 stacks onto <counter> \[size=((?<-counter>.)*?)] # pop n stacks, ((?<-counter>.)*?) # then another m. \[/size] #
Since the number of stacks cannot be negative, m + n can never exceed 26.
Do note that whatever lies inbetween =
and ]
is meaningless to regex itself. [size=99]
is no different than [size=10]
, regardless of what follows ]
.
Try it on regex101.com.
Obviously, yes. It wouldn't make sense to have the minimum length smaller than that of concrete parts, so let's just assume Ln = 24. This means Ld >= Ln - Lc = 24 - 14 = 10.
We go the same way: Count first, decrease later. As with Lm, if there are not enough stacks to be popped, the regex simply fails.
(?<max>){26} #
\[size=((?<min-max>.)*?)] # Push a stack to <min>
((?<min-max>.)*?) # whenever we pop a stack from <max>.
\[/size] #
(?<-min>){10} # Finally, pop 10 stacks from <min>.
Try it on regex101.com.
It relies on three simple things: the concrete parts, the dynamic parts as well as how good you are at counting, additions and substractions. The steps can be generically described as:
(?<-counter>)
to any single-character dynamic expression, including but not limited to character classes ([]
), metasequences (e.g. \d
, \w
, etc.) and the almighty dot (.
).If there are branches, follow those steps separately for each branch:
(?<counter>){20}
(?:
(?<-counter>){3}foo
(?<-counter>.)+ # Maximum 17 characters
|
(?<-counter>){4}baar
(?<-counter>.)+ # Maximum 10 characters
(?<-counter>){6}bazqux
)
Of course, you can also add new stacks to the counter should the need arises:
(?<counter>){20}
(?:
(?<-counter>){3}foo
(?<-counter>.)+
|
(?<counter>){10} # For some inexplicable reasons, add 10 to the limit.
(?<-counter>){6}baar # ...and, in the same spirit, 'baar' counts as -6.
(?<-counter>.)+
(?<-counter>){6}bazqux
)
Pretty powerful, isn't it? However...
No. Unless you don't have access to a programming language, you should not use this trick in your (production) code. It is better to match all instances and then filter those you don't want out based on their length. Simplicity counts.
Upvotes: 2
Reputation: 23
You could do a negative lookahead for the number of characters you want. So if you have a complex regex to get a specific format and you wanted to limit it to say, 50 characters. Then you could preface it with:
(?!.{51})
Upvotes: 1
Reputation: 129
Limit the length of characters in a regular expression:
^[a-z]{6,15}$'
Limit length of characters or numbers in a regular expression:
^[a-z | 0-9]{6,15}$'
Upvotes: 4
Reputation: 441
If you want to restrict valid input to integer values between 1 and 100, this will do it:
^([1-9]|[1-9][0-9]|100)$
Explanation:
This will not accept:
Upvotes: 19
Reputation: 11
(^(\d{2})|^(\d{4})|^(\d{5}))$
This expression takes the number of length 2,4 and 5. Valid Inputs are 12 1234 12345
Upvotes: 0
Reputation: 311526
Is there a way to limit a regex to 100 characters WITH regex?
Your example suggests that you'd like to grab a number from inside the regex and then use this number to place a maximum length on another part that is matched later in the regex. This usually isn't possible in a single pass. Your best bet is to have two separate regular expressions:
If you just want to limit the number of characters matched by an expression, most regular expressions support bounds by using braces. For instance,
\d{3}-\d{3}-\d{4}
will match (US) phone numbers: exactly three digits, then a hyphen, then exactly three digits, then another hyphen, then exactly four digits.
Likewise, you can set upper or lower limits:
\d{5,10}
means "at least 5, but not more than 10 digits".
Update: The OP clarified that he's trying to limit the value, not the length. My new answer is don't use regular expressions for that. Extract the value, then compare it against the maximum you extracted from the size parameter. It's much less error-prone.
Upvotes: 40