Reputation: 2427
I have a Python script in which I'm trying to parse a string of the form:
one[two=three].four
Each word should be in its own capture group. The punctuation should not be captured.
Additionally, each part of the string is optional, and the part delimited by brackets can be repeated. So the above is the most complete example, but all of the following should also be valid matches:
one
.four
one[two=three][five=six]
[two=three]
[two].four
[two][five]
[]
In the case that one of the words is not present, instead of failing to capture, I'd like to capture a string of length 0.
The regex that I'm using is as follows:
pattern = re.compile(
r"""
^ # Assert start of string
(?P<cap1> # Start a new group for "one"
[a-z]* #
) #
(?: # Start a group for "two" and "three"
\[ # Match the "["
(?P<cap_2> # Start a group for "two"
[a-z]* #
) #
=? # Delimit two/three with "="
(?P<cap_3> # Start a group for "three"
[a-z]* #
) #
\] # Match the "]"
)* # End the two-three group, allowing repeats
\.? # Delimit three/four with "."
(?P<cap_4> # Begin a group for "four"
[a-z]* #
) #
$ # Assert end of string
""", re.IGNORECASE|re.VERBOSE)
What I've tried to do during that regex is, instead of allowing 0 or 1 of a group by appending ?
to the entire group, I allowed any number of characters to be in the actual match itself by appending *
to the character selection. Therefore, the match is forced to exist, but the string itself can have a length of 0.
The problem comes with the bracketed block. The package I'm using allows me to access all captures of a named group using match.captures(groupname)
. This way, I can access all matches for cap_2
using match.captures("cap_2")
:
>>> pattern.match("one[two=three][five=six].four").captures("cap_2")
["two", "five"]
This works fine when the brackets are present. However, when they're not:
>>> pattern.match("one.four").captures("cap_2")
[]
Expected: [""]
I expect there to be at least an empty string present for cap_2
and cap_3
. However, there's nothing.
This is because of the *
I place after the two+three section of the regex, in order to allow multiple of those groups - this is allowing that part of the regex to be skipped altogether.
Changing that *
to +
breaks the regex, as now it won't match the above example at all because it's trying to match the brackets. Adding a ?
after each bracket means that cap_1
and cap_2
are not delimited and includes what should be in cap_4
in cap_3
.
What's the solution here? How can I allow a group containing two capturing groups to be executed multiple times, but match only empty strings when the brackets are not present?
Upvotes: 2
Views: 1179
Reputation: 626709
You may solve the problem by replacing *
after the (?:\[(?P<cap_2>[a-z]*)=?(?P<cap_3>[a-z]*)\])*
repeated group with +
and adding an alternative with a second occurrence of groups cap_2
and cap_3
(note that PyPi regex module supports multiple identically named groups in the same regex):
import regex as re
s = 'one.four'
pattern = re.compile(
r"""
^ # Assert start of string
(?P<cap1> # Start a new group for "one"
[a-z]* #
) #
(?:
(?: # Start a group for "two" and "three"
\[ # Match the "["
(?P<cap_2> # Start a group for "two"
[a-z]* #
) #
=? # Delimit two/three with "="
(?P<cap_3> # Start a group for "three"
[a-z]* #
) #
\] # Match the "]"
)+ # End the two-three group, allowing repeats
|
(?P<cap_2>)(?P<cap_3>)
)
\.? # Delimit three/four with "."
(?P<cap_4> # Begin a group for "four"
[a-z]* #
) #
$ # Assert end of string
""", re.IGNORECASE|re.VERBOSE)
print ( pattern.match("one.four").captures("cap_2") )
# => ['']
See the Python demo
The thing is, the (?:\[(?P<cap_2>[a-z]*)=?(?P<cap_3>[a-z]*)\])*
part matches by all means since it can match an empty string, and if you just add the alternatives without changing the modifier, the expected results won't be achieved. So, if there is no [...]
s, the second cap_2
and cap_3
groups with empty patterns willmatch by all means capturing an empty string.
Upvotes: 1
Reputation: 2401
|
()
or (not empty|)
Combined and applied to your case, that would look like this (simplified):
((?:\[stuff inside the brackets\])+|)
The outermost group captures the whole bracket construct (e.g. [two][three]
) if it's present or the empty string. Notice that the left part of the |
operator now has to match at least once (+
).
Upvotes: 0