Reputation: 175
I'm trying to make a regular expression able to handle inputs like bellow to extract month and year while handling all these different cases and extract the 2 groups (start and end) like this:
From August 2017 - September 2018 (output: {August 2017},{September 2018})
From August to September 2018 (output: {August},{September 2018})
July 2009 - August 2019 (output: {July 2009},{August 2019})
De Aout 2019 a July 2020 (output: {Aout 2019},{July 2020})
De Juillet a Aout 2020 (output: {Juillet},{Aout 2020})
Juillet - Aout 2019 (output: {Juillet},{Aout 2019})
Juillet a Aout 2019 (output: {Juillet},{Aout 2019})
I found this regex here which does a good job (regex101 link):
(?P<fmonth>\w+.\d*)\s+\D+\s+(?P<smonth>\D+.\d+)
But the problem with it is that it does not handle these 2 cases where there is no year in the first part:
De Juillet a Aout 2020
From August to September 2018
I think it's missing a part to exclude/ignore specific words like "From" and "De".
Any ideas or solutions ?
Upvotes: 2
Views: 212
Reputation: 626845
Note that \D+
is a very generic pattern, it matches August to
in From August to September 2018
, i.e. any 1+ non-digit symbols. Also, \w
matches letters, digits and _
s, it may be more appropriate to only match letters when you need to match month names, and for that, all you need is to subtract \d
and _
from it ([^\W\d_]
).
You may use a bit more precise regex:
(?P<fmonth>[^\W\d_]+(?:\W+\d+)?)\s+(?:to|a|-)\s+(?P<smonth>[^\W\d_]+\W+\d+)
See the regex demo
Details
(?P<fmonth>[^\W\d_]+(?:\W+\d+)?)
- fmonth group: 1+ letters and an optional sequence of 1+ non-word chars followed with 1+ digits\s+
- 1+ whitespaces(?:to|a|-)
- to
, a
or -
\s+
- 1+ whitespaces(?P<smonth>[^\W\d_]+\W+\d+)
- smonth group: 1+ letters, 1+ non-word chars, 1+ digitsUpvotes: 2