heiiRa
heiiRa

Reputation: 103

RegEx prioritize the longest pattern

I got some strings to search for a match with regular expressions.

foo
AB0001
AB0002 foo
foo AB0003
foo AB0004A AB0004.1
AB0005.1 foo AB0005A bar AB0005

The desired matches are one ID per line while IDs with a letter at the end should be prioritized, whereas IDs with a .1 should be ignored.

foo                              -> no match
AB0001                           -> AB0001
AB0002 foo                       -> AB0002
foo AB0003.1                     -> no match
foo AB0004A AB0004.1             -> AB0004A
AB0005.1 foo AB0005A bar AB0005  -> AB0005A

I thought I could easily use the priority given by the alternation | symbol to prioritize the ID with a capital letter at the end but nevertheless there are always given multiple matches.

My suggestion: regex101.com/r/yP5kX4/1

Offtopic: When to use the whole RegEx starting with ^ ending with $ and work with capture/non-capturing groups and when should I write RegEx as short as possible?

Upvotes: 0

Views: 621

Answers (4)

Alan Moore
Alan Moore

Reputation: 75222

\b(AB\d{4}(?!\.\d)[A-Z]?)\b

DEMO

That's AB followed by four digits, which must not be followed by a decimal-digit sequence, but may end with a letter. The word boundaries (\b) help insure that the matched sequence is not part of a longer sequence that just happens to look like an ID.

An alternation-based solution is never going to work. It's true that if two or more branches of an alternation can match at a given point, the first one is always selected (in most regex flavors, anyway). But that doesn't help you, because the regex engine always favors the first (leftmost) match; that's its highest priority. So the first match wins no matter which branch of the alternation it uses.

As for the anchors (^ and $), they're usually needed only when you want to match the whole string, or a whole line in multiline mode (and BTW, since you're not using them, you don't need the /m flag; all it does is change the meaning of the anchors).

The issue of capturing groups is interesting here because you don't need them. The only reason I used one is because the Regex101 site doesn't show the matches in the side panel unless they're in capturing groups. It's an annoying glitch in an otherwise very useful site. But generally speaking, you use capturing groups when you need to extract specific portions of the match, or when you need to use backreferences in the regex itself.

Upvotes: 0

Quinn
Quinn

Reputation: 4504

The following regex should do:

(AB(?:[0-9A-Z]{5}|[0-9]{4}))(?:\s+)

I added a non-capturing group (?:\s+) to capture space(s) after ID match. The demo is HERE:

My thoughts: (Please correct me if I am wrong)

When to use the whole RegEx starting with ^ ending with $? If regex is to match from start (^) to the end ($) of the whole string.

And work with capture/non-capturing groups? Use capturing groups if you want to extract/reference that information; use non-capturing groups if you just want to match, but no extracting and referencing. Please take a look at: What is a non-capturing group? What does a question mark followed by a colon (?:) mean?.

When should I write RegEx as short as possible? The shorter the better, as long as it works

Upvotes: 0

user557597
user557597

Reputation:

This is one way. It's kind of complex because you need to be lazy to find
the first instance of ID.

This regex is to be used in Multi-Line mode. Add a (?m) to the beginning
of the regex if you can.

The resulting ID is in capture group 1.

^.*?\b([A-Z]+\d+[A-Z]|[A-Z]+\d+(?!\.\d)(?!.*?\b[A-Z]+\d+[A-Z]))\b

Explained

 ^                                  # Beginning of string
 .*?                                # Any char, lazy to get first instance
 \b    
 (                                  # (1 start), the ID
      [A-Z]+ \d+ [A-Z]                   # Priority, with trailing letter
   |                                   # or,
      [A-Z]+ \d+                         # no trailing letter
      (?! \. \d )                        # no dot digit after digit
      (?! .*? \b [A-Z]+ \d+ [A-Z] )      # and only if not a trailing  letter id downstream
 )                                  # (1 end)
 \b     

Upvotes: 1

Saleem
Saleem

Reputation: 8978

I'd like to detect string in R 3.1.3 this way:

grepl("(?<!\\.)[A-Z0-9]+?(?=\\s)", subject, perl=TRUE);

based on input you posted in your question, output will be:

INPUT

foo
AB0001
AB0002 foo
foo AB0003
foo AB0004A AB0004.1
AB0005.1 foo AB0005A bar AB0005

-

OUTPUT

  • AB0001
  • AB0002
  • AB0003
  • AB0004A
  • AB0005A

Upvotes: 0

Related Questions