salient
salient

Reputation: 2486

Optional matching in regex

Attempting to match these input strings into three matching groups (Regex101 link):

    | input string  | x  | y   | z  |
------------------------------------
  I | a             | a  |     |    |
 II | a - b         | a  | b   |    |
III | a - b-c       | a  | b-c |    |
 IV | a - b, 12     | a  | b   | 12 |
  V | a - 12        | a  |     | 12 |
 VI | 12            |    |     | 12 |

So the anatomy of the input strings is as follows:

  • optional first part with free text up until a hyphen with surrounding whitespace (-) or the input string ends
  • optional second part with any character after the first hyphen with surrounding whitespace up until a comma or the input string ends
  • optionally exactly two digits at the end

I've tried a plethora of different solutions, this is my current attempt:

^(?P<x>.*)(?:-)(?P<y>.*)(?<!\d)(?P<z>\d{0,2})(?!\d)$

It handles scenarios II, IV and V OK (must do some trimming of white space as well), however:

Upvotes: 4

Views: 1353

Answers (3)

Jan
Jan

Reputation: 43169

Interesting question, this is the solution I came up with:

^
    (?:(?P<x>\D*?)(?=(?:\ -\ |$)))?
    (?:.*?(?<=\ -\ )(?P<y>[^\d,]+)(?=,|$))?
    (?:.*?(?P<z>\d{2}$))?
$

See a demo on regex101.com (and mind the verbose [aka x] and multiline [aka m] modifier):


More verbose:

^                       # start of the line
    (?:                 # non capturing parentheses
        (?P<x>\D*?)     # no digits lazily ...
        (?=\ -\ |$)     # up until either " - " or end of string
    )?                  # optional
    (?:
        .*?             # match everything lazily
        (?<=\ -\ )      # pos. lookbehind
        (?P<y>[^\d,]+)  # not a comma or digit
        (?=,|$)         # up until a comma or end of string
    )?
    (?:
        .*?
        (?P<z>\d{2}$)   # two digits at the end
    )?
$

Upvotes: 2

Tomalak
Tomalak

Reputation: 338208

This seems to do reasonably well:

^(?:(.*?)(?: - |$))?(?:(.*?)(?:, |$))?(\d\d$)?$

The values of interest will be in groups 1, 2 and 3, respectively.

The only culprit is that "two digits" will be

  • in group 2 for case V and
  • in group 1 for case VI,

the other groups being empty in those cases.

This is because "two digits" happily matches the "free text until the delimiter, or the string ends" rule.

You could use negative look-aheads to force the two digits into the last group, but unless "two digits" aren't legal values for groups 1 and 2, this will not be correct. In any case it would make the expression unwieldy:

^(?:((?!\d\d$).*?)(?: - |$))?(?:((?!\d\d$).*?)(?:, |$))?(\d\d$)?$

Breakdown:

^                    # string starts
(?:(.*?)(?: - |$))?  # any text, reluctantly, and " - " or the string ends
(?:(.*?)(?:, |$))?   # any text, reluctantly, and ", " or the string ends
(\d\d$)?             # two digits and the string ends
$                    # string ends

Upvotes: 5

Aran-Fey
Aran-Fey

Reputation: 43166

There are less verbose regexes that achieve this task, but this one encodes the logic in a pretty straightforward fashion:

^(?P<x>(?!\d\d$)(?:(?! - ).)*)?(?: - (?P<y>(?!\d\d$)[^,\n]*)?(?:, )?)?(?P<z>\d\d)?$

^                   # assert start of string/line
(?P<x>              # capture in group "x"
    (?!\d\d$)       # if the whole string is just two digits, don't capture them in group x
    (?:             # as long as...
        (?! - )     # ...we don't come across the text " - "...
        .           # ...consume the next character
    )*
)?                  # make group x optional
(?:                 # if possible...
     -              # consume the " - " separator
    (?P<y>          # then capture group "y"
        (?!\d\d$)   # again, only if this isn't two digits which belong in group z
        [^,\n]*     # consume everything up to a comma
    )?              # group y is also optional
    (?:, )?         # consume the ", " separator, if present
)?
(?P<z>              # finally, capture in group "z"...
    \d\d            # ...two digits...
)?                  # ...if present
$                   # assert end of string

Upvotes: 3

Related Questions