Reputation: 612
So I have a few documents I'm extracting the date from, my regex expression being:
query = """([0-9]{1,2})?\s{1,2}([jJ]anurary|[fF]eburary|[mM]arch|[aA]pril
|[mM]ay|[jJ]une|[jJ]uly|[aA]ugust|[sS]eptember|[oO]ctober|[jJ]anuary
|[nN]ovember|[dD]ecember|[jJ]an|[fF]eb|[mM]ar|[aA]pr|[aA]ug|[sS]ep|[sS]ept
|[oO]ct|[nN]ov|[dD]ec|[fF]ebruary)\s{1,2}([0-9]{2,4})"""
OR
query = """([0-9]{1,2})?\s{1,2}([jJ]anurary|[fF]eburary|[mM]arch|[aA]pril|
[mM]ay|[jJ]une|[jJ]uly|[aA]ugust|[sS]eptember|[oO]ctober|[jJ]anuary|
[nN]ovember|[dD]ecember|[jJ]an|[fF]eb|[mM]ar|[aA]pr|[aA]ug|[sS]ep|[sS]ept|
[oO]ct|[nN]ov|[dD]ec|[fF]ebruary)\s{1,2}([0-9]{2,4})"""
The only difference between the two is one has |'s at the beginning of new each line, and the other has it at the end of the new line. These two match different things - specifically, with | at the end of the line I won't match May, but if its at the beginning of a line I won't match January (assuming the rest of the day & yr & spaces are correct - I literally just move the or position around and what I was just matching I no longer match & vice versa). Am I doing something wrong somehow, is there a way around this, or is there correct way to do this instead? Obviously the goal is to match both. If you want to try it out yourself, the cases I can easily replicate are '8 may 2018' and '25 january 2018'.
The rest of my code is just re.search(query, doc) (which is whats failing to match).
Note - python 3.6.8 regex==2018.1.10
Upvotes: 0
Views: 49
Reputation: 808
Because of not using single-line long regex, multi-line regex is great to do and the following link is wonderful to have multi-line regex in python.
see Pythonic way to create a long multi-line string
like Ian mentioned.
Upvotes: 0
Reputation: 333
When you enter a string with triple quotes, all characters within the triple quotes are recorded, including \n
. This is what your query string really looks like:
>>> query = """([0-9]{1,2})?\s{1,2}([jJ]anurary|[fF]eburary|[mM]arch|[aA]pril|
... [mM]ay|[jJ]une|[jJ]uly|[aA]ugust|[sS]eptember|[oO]ctober|[jJ]anuary|
... [nN]ovember|[dD]ecember|[jJ]an|[fF]eb|[mM]ar|[aA]pr|[aA]ug|[sS]ep|[sS]ept|
... [oO]ct|[nN]ov|[dD]ec|[fF]ebruary)\s{1,2}([0-9]{2,4})"""
>>> query
'([0-9]{1,2})?\\s{1,2}([jJ]anurary|[fF]eburary|[mM]arch|[aA]pril|\n [mM]ay|[jJ]une|[jJ]uly|[aA]ugust|[sS]eptember|[oO]ctober|[jJ]anuary|\n [nN]ovember|[dD]ecember|[jJ]an|[fF]eb|[mM]ar|[aA]pr|[aA]ug|[sS]ep|[sS]ept|\n [oO]ct|[nN]ov|[dD]ec|[fF]ebruary)\\s{1,2}([0-9]{2,4})'
Avoid this by using \
line continuation to enter the string on multiple lines:
query = r"([0-9]{1,2})?\s{1,2}([jJ]anurary|[fF]eburary|[mM]arch|[aA]pril|[mM]ay|[jJ]une|[jJ]uly|[aA]ugust|" \
r"[sS]eptember|[oO]ctober|[jJ]anuary|[nN]ovember|[dD]ecember|[jJ]an|[fF]eb|[mM]ar|[aA]pr|[aA]ug|[sS]ep|" \
r"[sS]ept|[oO]ct|[nN]ov|[dD]ec|[fF]ebruary)\s{1,2}([0-9]{2,4})"
You can also keep your triple quotes and suppress the newline with \
(remember you can't indent the lines below the first because those spaces/tabs will be included in the string):
query = """([0-9]{1,2})?\s{1,2}([jJ]anurary|[fF]eburary|[mM]arch|[aA]pril|\
[mM]ay|[jJ]une|[jJ]uly|[aA]ugust|[sS]eptember|[oO]ctober|[jJ]anuary|\
[nN]ovember|[dD]ecember|[jJ]an|[fF]eb|[mM]ar|[aA]pr|[aA]ug|[sS]ep|[sS]ept|\
[oO]ct|[nN]ov|[dD]ec|[fF]ebruary)\s{1,2}([0-9]{2,4})"""
See also: Pythonic way to create a long multi-line string
Upvotes: 1
Reputation: 2937
As a few people have mentioned in the comments, you should try re.X
(or re.VERBOSE
)
This will allow you to both put the regex on multiple lines, as well as include comments
query = """
# Day
([0-9]{1,2})?
\s{1,2}
# Long month
([jJ]anurary|[fF]eburary|[mM]arch
|[aA]pril|[mM]ay|[jJ]une
|[jJ]uly|[aA]ugust|[sS]eptember
|[oO]ctober|[nN]ovember|[dD]ecember
# Short month
|[jJ]an|[fF]eb|[mM]ar|[aA]pr|[aA]ug
|[sS]ept?|[oO]ct|[nN]ov|[dD]ec)
\s{1,2}
# Year
([0-9]{2,4})"""
This can be useful for separating and documenting your regex into more manageable pieces.
Also, you probably want to compile your regex if you use it more than once. So you would use it like pattern = re.compile(query, re.X)
or pattern = re.compile(query, re.VERBOSE)
.
Upvotes: 1