Reputation: 214
I have a set of strings with fairly inconsistent naming, that should be structured enough to be divided into groups though.
Here's an excerpt:
test test 1970-2020 w15.txt
test 1970-2020 w15.csv
test 1990-99 q1 .txt
test 1981 w15 .csv
test test w15.csv
I am trying to extract information by groups (test-name, (year)?, suffix, type) using the following RegEx:
(.*)\s+([0-9]+(\-[0-9]+)?\s+)?((w|q)[0-9]+(\s+)?)(\..*)$
It works except for the optional group matching the years (interval of year's, single year or no year at all). What am I missing to make the pattern work?
Here's also a link to RegEx101 for testing:
https://regex101.com/r/wG3aM3/817
Upvotes: 0
Views: 4370
Reputation: 163352
You could make the pattern a bit more specific and make the content of the year optional
^(.*?)\s+((?:\d{4}(?:-(?:\d{4}|\d{2}))?)?)\s+([wq][0-9]+)\s*(\.\w+)$
Explanation
^
Start of string(.*?)
Capture group 1 Match 0+ times any char except a newline non greedy\s+
Match 1+ whitespace chars(
Capture group 2
(?:
Non capture group
\d{4}(?:-(?:\d{4}|\d{2}))?
Match 4 digits and optionally -
and 2 or 4 digits)?
Close non capture group and make the year optional)
Close group 2\s+
Match 1+ whitespace chars([wq][0-9]+)
Capture group 3 Match either w
or q
and 1+ digits 0-9\s*
Match 0+ whitespace chars(\.\w+)
Capture group 4, match a dot and 1+ word characters$
End of stringNote that \s
could also match a newline.
Upvotes: 6