Reputation: 151
I can't find how to exclude some chars from fixed length part of a string
with ^XXX-\d{4}-(?![XYT])[A-Z]{4}$
I can exclude XYT
from the first char of the last string so
XXX-0000-AAAA is ok
XXX-0000-XAAA is not ok
my problem is that I do not want X
, Y
or T
in any part of the last segment
XXX-0000-AAXA is not ok
XXX-0000-ABXX is not ok
XXX-0000-ABCT is not ok
and so on
How can I do that?
To be more precise, I add that XYT are variables, so the solution of a fixed list works but is not convenient
Upvotes: 3
Views: 2757
Reputation: 47934
Okay, This seems to be the most efficient, correct pattern I can make: (Demo)
^X{3}-\d{4}-(?![A-Z]{0,3}[XYT])[A-Z]{4}$
I have set up a battery of strings to match against which should/does expose any flaws in the patterns posted on this page. My pattern completes the test in 176 steps and provides correct matching. This makes it the best pattern that uses a negative lookahead as requested by the OP.
For apples to apples comparison:
original 190 steps
^XXX-\d{4}-(?![XYT])[A-Z]{4}$
user1875921 Demoincorrect 300 steps
XXX-\d{4}-(?!.*?[XYT])[A-Z]{4}
Sahil Gulati Democorrect 140 steps
^XXX-\d{4}-[ABCDEFGHIJKLMNOPQRSUVWZ]{4}$
[commented] Democorrect 245 steps
^XXX-\d{4}-((?![XYT])[A-Z]){4}$
Bohemian #1 Democorrect 279 steps
^XXX-\d{4}-(?:(?![XYT])[A-Z]){4}$
melpomene Demon/a -
^XXX-\d{4}-[A-Z&&[^XYT]]{4}$
Bohemian #2
Upvotes: 1
Reputation: 6180
TLDR; ^XXX-[0-9]{4}-[^XYT -@[-²]{4}$
The question on this thread highlights a challenge when using regular expressions in a fashion that almost requires "boolean" ways to represent character classes, such as ['A-Z' but not 'XYZ']. For this reason, this answer is presented (as an edit and an update) for the benefit of others facing similar scenarios as the one the OP described.
Given the lack of direct support for a syntax such as ['A-Z' but not 'XYZ']; the only way to achieve this type of logic and control the order of precedence for overlapping expressions in a regex is to use features such as Lookaround Assertions
However, inefficiently applying them can be extremely costly, as pointed out on one of the other answers here.
Here are some ways where the drastic difference in performance criteria for an application makes it impossible to have a generic regex that achieves this
[[-\`a-~!-@]
means, but some don't know what \W
or [:punct:]
mean for instance.re.findall()
can be used as the operational equivalent of a positive lookbehind with unknown repetition length. ^XXX-[0-9]{4}-[^XYT -@[-²]{4}$
Here is an example where 10000 strings are matched of 10100: https://regex101.com/r/YJ5xME/1
Here is an example where 100 strings are matched of 10100: https://regex101.com/r/d1l5af/1
There's not a great difference in performance between the two at 10000 strings, by contrast, this regex: ^XXX-[0-9]{4}-([A-Z](?<=[^XYT])){4}$
takes over twice the time to match the 10000 strings.
An example on the command line using bash
:
Take a string file stringfile
, with contents such as:
XXX-0000-SNUR XXX-0000-FHDZ XXX-0000-+439 XXX-0000-04X9 XXX-0000-/1Y+ XXX-0000-X/X9 XXX-0000-Y6X9 XXX-0000-XY16 XXX-0000-0T94 XXX-0000-++6Y XXX-0000-TT+3 XXX-0000-NLNL XXX-0000-QPSE
Use of a variable $exclude
for example:
exclude="XYT" egrep "^XXX-[0-9]{4}-([^$exclude[^XYT -@[-²]){4}$" < stringfile
Correct matches:
XXX-0000-SNUR XXX-0000-FHDZ XXX-0000-NLNL XXX-0000-QPSE
find
with -iregex
(if dealing with filenames)egrep
(grep -E
)And here's where efficiency meets accuracy. Example in the command line:
exclude="XYT"
customCharClass="$(
alpha=ABCDEFGHIJKLMNOPQRSTUVWXYZ
echo "${alpha[@]}" \
| sed -E -e "s/[$exclude]//g")"
egrep "^XXX-[0-9]{4}-([$customCharClass]){4}$" < stringfile
Now this is the regex applied to the stringfile:
^XXX-[0-9]{4}-([ABCDEFGHIJKLMNOPQRSUVWZ]){4}$
(Notice there is no X Y or T)
This regex was posted by @mickmackusa
^X{3}-\d{4}-(?![A-Z]{0,3}[XYT])[A-Z]{4}$
This regex performs well, is cleaner than the alternative presented in this answer, (It does require PCRE however).
This performs slightly slower (but by no means inefficient or wasteful), but is guaranteed to produce only an [A-Z] match (with the XYT excluded).
This highlights the need to evaluate the performance criteria specific to the application when designing a regex that may require lookarounds
Upvotes: 2
Reputation: 15141
Regex: XXX-\d{4}-(?!.*?[XYT])[A-Z]{4}
1.
XXX-\d{4}
this will matchXXX-
and thenfour digits
2.
(?!.*?[XYT])
negative look ahead forX
Y
andT
3.
[A-Z]{4}
matches4
characters which can includeA-Z
.
Upvotes: 3
Reputation: 425083
There are two "elegant" ways of doing this. This simplest to understand is:
^XXX-\d{4}-((?![XYT])[A-Z]){4}$
This is very close to what you had, but instead applies the negative look-ahead to every character in the repetition.
The other way is to use a character class subtraction:
^XXX-\d{4}-[A-Z&&[^XYT]]{4}$
You rarely see this syntax used, so it might be good to use if nothing else to impress your colleagues.
Upvotes: 4
Reputation: 85787
Why not just use [A-SU-WZ]{4}
for the last part? I.e. only match the letters you want in the first place.
Alternatively, make the look-ahead part of the repetition: (?:(?![XYT])[A-Z]){4}
Upvotes: 5