Reputation: 281
I am learning the ropes with regular expression in Python. I have the code below:
import re
test = '"(Z101+Z102+Z1034+Z104)/4"'
regex = re.compile(r"[\(\+]([XYZ]\d\d\d)[\)\+]")
regex.findall(test)
It returns:
['Z101', 'Z104']
However, when I change 'Z101' to 'YZ101':
import re
test = '"(YZ101+Z102+Z1034+Z104)/4"'
regex = re.compile(r"[\(\+]([XYZ]\d\d\d)[\)\+]")
regex.findall(test)
It returns:
['Z102', 'Z104']
The purpose is to extract strings containing X
, Y
or Z
following by any set of three digits. Therefore, the desired output for the first code would be:
['Z101', 'Z102', 'Z104']
How to fix the compile and get the correct output?
Upvotes: 2
Views: 657
Reputation: 626691
The left and right hand boundary patterns ([\(\+]
and [\)\+]
) are consuming the text they match, and thus consecutive matches are not thus detected.
You can solve the problem using lookarounds,
r"(?<=[(+])([XYZ]\d\d\d)(?=[)+])"
r"(?<=[(+])[XYZ]\d{3}(?=[)+])"
Details
(?<=[(+])
- a positive lookbehind that matches a location that is
immediately preceded with (
or +
[XYZ]
- X
, Y
or Z
\d{3}
- three digits(?=[)+])
- a positive lookahead that makes sure there is )
or +
immediately to the right of the current location.Note the word boundary, \b
, can solve the issue in some situations, it might also help you here, too.
Upvotes: 3
Reputation: 587
Your pattern is looking for:
It's not selecting the 'Z101' because when you add 'Y', that substring isn't immediately preceded by '(' or '+'.
One option would be to leave 1 and 4 out of the pattern. In this example, you would get exactly what you want. That pattern would be r'[XYZ]\d\d\d'. Depending on your data, however, that might create a different problem down the road.
Another option would be to include the possibility for a prefixed character with '?'. The '?' means 'zero or one' when used as a quantifier (but it can also modify other quantifiers, but that's a different topic). To do that, your pattern would be r"[(+][XYZ]?([XYZ]\d\d\d)[)+]"
Upvotes: 1
Reputation: 520878
Use re.findall
with the pattern [XYZ]\d{3}\b
:
test = '"(YZ101+Z102+Z1034+Z104)/4"'
matches = re.findall(r'[XYZ]\d{3}\b', test)
print(matches) # ['Z101', 'Z102', 'Z104']
Upvotes: 2