oepix
oepix

Reputation: 214

RegEx optional group with optional sub-group

I have a set of strings with fairly inconsistent naming, that should be structured enough to be divided into groups though.

Here's an excerpt:

test test 1970-2020 w15.txt
test 1970-2020 w15.csv
test  1990-99 q1 .txt
test 1981 w15 .csv
test test  w15.csv

I am trying to extract information by groups (test-name, (year)?, suffix, type) using the following RegEx:

(.*)\s+([0-9]+(\-[0-9]+)?\s+)?((w|q)[0-9]+(\s+)?)(\..*)$

It works except for the optional group matching the years (interval of year's, single year or no year at all). What am I missing to make the pattern work?

Here's also a link to RegEx101 for testing:

https://regex101.com/r/wG3aM3/817

Upvotes: 0

Views: 4370

Answers (1)

The fourth bird
The fourth bird

Reputation: 163352

You could make the pattern a bit more specific and make the content of the year optional

^(.*?)\s+((?:\d{4}(?:-(?:\d{4}|\d{2}))?)?)\s+([wq][0-9]+)\s*(\.\w+)$

Explanation

  • ^ Start of string
  • (.*?) Capture group 1 Match 0+ times any char except a newline non greedy
  • \s+ Match 1+ whitespace chars
  • ( Capture group 2
    • (?: Non capture group
      • \d{4}(?:-(?:\d{4}|\d{2}))? Match 4 digits and optionally - and 2 or 4 digits
    • )? Close non capture group and make the year optional
  • ) Close group 2
  • \s+ Match 1+ whitespace chars
  • ([wq][0-9]+) Capture group 3 Match either w or q and 1+ digits 0-9
  • \s* Match 0+ whitespace chars
  • (\.\w+) Capture group 4, match a dot and 1+ word characters
  • $ End of string

Regex demo

Note that \s could also match a newline.

Upvotes: 6

Related Questions