user1875921
user1875921

Reputation: 151

Regex - Accept AZ excluding some letters for a fixed length

I can't find how to exclude some chars from fixed length part of a string

with ^XXX-\d{4}-(?![XYT])[A-Z]{4}$ I can exclude XYT from the first char of the last string so

XXX-0000-AAAA is ok
XXX-0000-XAAA is not ok

my problem is that I do not want X, Y or T in any part of the last segment

XXX-0000-AAXA is not ok
XXX-0000-ABXX is not ok
XXX-0000-ABCT is not ok
and so on

How can I do that?

To be more precise, I add that XYT are variables, so the solution of a fixed list works but is not convenient

Upvotes: 3

Views: 2757

Answers (5)

mickmackusa
mickmackusa

Reputation: 47934

Okay, This seems to be the most efficient, correct pattern I can make: (Demo)

^X{3}-\d{4}-(?![A-Z]{0,3}[XYT])[A-Z]{4}$

I have set up a battery of strings to match against which should/does expose any flaws in the patterns posted on this page. My pattern completes the test in 176 steps and provides correct matching. This makes it the best pattern that uses a negative lookahead as requested by the OP.

For apples to apples comparison:

original 190 steps ^XXX-\d{4}-(?![XYT])[A-Z]{4}$ user1875921 Demo

incorrect 300 steps XXX-\d{4}-(?!.*?[XYT])[A-Z]{4} Sahil Gulati Demo

correct 140 steps ^XXX-\d{4}-[ABCDEFGHIJKLMNOPQRSUVWZ]{4}$ [commented] Demo

correct 245 steps ^XXX-\d{4}-((?![XYT])[A-Z]){4}$ Bohemian #1 Demo

correct 279 steps ^XXX-\d{4}-(?:(?![XYT])[A-Z]){4}$ melpomene Demo

n/a - ^XXX-\d{4}-[A-Z&&[^XYT]]{4}$ Bohemian #2

Upvotes: 1

hmedia1
hmedia1

Reputation: 6180

Here is a versatile and fast alternative

TLDR; ^XXX-[0-9]{4}-[^XYT -@[-²]{4}$

The question on this thread highlights a challenge when using regular expressions in a fashion that almost requires "boolean" ways to represent character classes, such as ['A-Z' but not 'XYZ']. For this reason, this answer is presented (as an edit and an update) for the benefit of others facing similar scenarios as the one the OP described.

Given the lack of direct support for a syntax such as ['A-Z' but not 'XYZ']; the only way to achieve this type of logic and control the order of precedence for overlapping expressions in a regex is to use features such as Lookaround Assertions

However, inefficiently applying them can be extremely costly, as pointed out on one of the other answers here.

Here are some ways where the drastic difference in performance criteria for an application makes it impossible to have a generic regex that achieves this

  • The performance of single string matching, where speed differences are un-noticable, may call for a more reliable or robust regex. This may be the case for portability in code (i.e. almost every regex parser knows what [[-\`a-~!-@] means, but some don't know what \W or [:punct:] mean for instance.
  • At the other end of the scale, where nanoseconds are critical, then one may wish to re-evaluate many other parts of the process before getting caught up with the regex, but in any case, a very inflexible but high performance regex might be preferred here, where the system can be made compatible if it isn't already
  • The variety of strings has a major impact on performance, and so depending on the application, certain parts of a string can be assessed differently
  • For the same reason, the decision of how to structure the Lookaround should probably be determined based on the use case.
  • Where strings are part of a database and searching is done via an API or other built-in function, specific syntax or format may need to be used.
  • Aside from the matching expression, different regex libraries, functions, extensions, can change the entire way a regex is done using options. For example, the python re.findall() can be used as the operational equivalent of a positive lookbehind with unknown repetition length.
  • Some programs that implement regex are just much faster than others. This can more than offset the efficiency difference when comparing theoretical steps.

Here is a balanced approach to the question:

^XXX-[0-9]{4}-[^XYT -@[-²]{4}$

Here is an example where 10000 strings are matched of 10100: https://regex101.com/r/YJ5xME/1

Here is an example where 100 strings are matched of 10100: https://regex101.com/r/d1l5af/1

There's not a great difference in performance between the two at 10000 strings, by contrast, this regex: ^XXX-[0-9]{4}-([A-Z](?<=[^XYT])){4}$ takes over twice the time to match the 10000 strings.

This is also compatible with applying an exclude variable as requested

An example on the command line using bash:

Take a string file stringfile, with contents such as:

  • XXX-0000-SNUR
    XXX-0000-FHDZ
    XXX-0000-+439
    XXX-0000-04X9
    XXX-0000-/1Y+
    XXX-0000-X/X9
    XXX-0000-Y6X9
    XXX-0000-XY16
    XXX-0000-0T94
    XXX-0000-++6Y
    XXX-0000-TT+3
    XXX-0000-NLNL
    XXX-0000-QPSE
    

Use of a variable $exclude for example:

  • exclude="XYT"
    egrep "^XXX-[0-9]{4}-([^$exclude[^XYT -@[-²]){4}$" < stringfile

Correct matches:

  • XXX-0000-SNUR
    XXX-0000-FHDZ
    XXX-0000-NLNL
    XXX-0000-QPSE

This is also compatible with extended regular expressions

  • Use cases like GNU find with -iregex (if dealing with filenames)
  • egrep (grep -E)
  • Anything with support for Modern Regular Expressions as defined in POSIX 1003.2

Can I just make the right expression before going to match thousands of lines ?

And here's where efficiency meets accuracy. Example in the command line:

exclude="XYT"
customCharClass="$(
alpha=ABCDEFGHIJKLMNOPQRSTUVWXYZ
echo "${alpha[@]}" \
| sed -E -e "s/[$exclude]//g")"

egrep "^XXX-[0-9]{4}-([$customCharClass]){4}$" < stringfile

Now this is the regex applied to the stringfile:

^XXX-[0-9]{4}-([ABCDEFGHIJKLMNOPQRSUVWZ]){4}$ 

(Notice there is no X Y or T)

  • The example given at the top, accounts for easy portability, quick performance, and accounts for the common use case where extended character sets aren't involved.
  • The example here, where a program decides the most effective search criteria, is guaranteed to account for all scenarios
  • The tradeoff, and performance evaluation criteria, is the use case. I.e. generating a custom search string will most certainly take longer for just one string and is by contrast negligable when searching thousands of files.

There's another answer on this thread that complements this answer.

This regex was posted by @mickmackusa

^X{3}-\d{4}-(?![A-Z]{0,3}[XYT])[A-Z]{4}$

This regex performs well, is cleaner than the alternative presented in this answer, (It does require PCRE however).

This performs slightly slower (but by no means inefficient or wasteful), but is guaranteed to produce only an [A-Z] match (with the XYT excluded).

This highlights the need to evaluate the performance criteria specific to the application when designing a regex that may require lookarounds

Upvotes: 2

Sahil Gulati
Sahil Gulati

Reputation: 15141

Regex: XXX-\d{4}-(?!.*?[XYT])[A-Z]{4}

1. XXX-\d{4} this will match XXX- and then four digits

2. (?!.*?[XYT]) negative look ahead for X Y and T

3. [A-Z]{4} matches 4 characters which can include A-Z.

Regex code demo

Upvotes: 3

Bohemian
Bohemian

Reputation: 425083

There are two "elegant" ways of doing this. This simplest to understand is:

^XXX-\d{4}-((?![XYT])[A-Z]){4}$

This is very close to what you had, but instead applies the negative look-ahead to every character in the repetition.

The other way is to use a character class subtraction:

^XXX-\d{4}-[A-Z&&[^XYT]]{4}$

You rarely see this syntax used, so it might be good to use if nothing else to impress your colleagues.

Upvotes: 4

melpomene
melpomene

Reputation: 85787

Why not just use [A-SU-WZ]{4} for the last part? I.e. only match the letters you want in the first place.

Alternatively, make the look-ahead part of the repetition: (?:(?![XYT])[A-Z]){4}

Upvotes: 5

Related Questions