Reputation: 2612
I'm trying to use the ?
quantifier to match a pattern only if it exists, but I can't get it working as I want. In the example below I'm trying to extract pair of digits following AZA
and ZZZ
where ZZZ
appears all the time, but AZA
is optional. When AZA
is missing, I just want to return a ('', [zzz-value])
pair (empty string instead of the AZA
value):
Input:
AZA:00zx---
ZZZ:32fd---
testxfiler
gsdkfklsd
fdsfsk
AZA:06x---
ZZZ:50----
gsdkfklsd
gsdkfklsd
fdsfsk
fdsfsk
gsdkfklsd
fdsfsk
ZZZ:32zzz----
fdsfsk
fdsfsk
gsdkfklsd
fdsfsk
AZA:46----
ZZZ:53---
Desired output:
[(00,32), (06, 50), ('',32), (46,53)]
My attempt:
re.findall('(?:AZA:([0-9]*))?.*?ZZZ:([0-9]*)', text, re.DOTALL)
My output:
[('00', '32'), ('', '50'), ('', '32'), ('', '53')
Upvotes: 0
Views: 134
Reputation: 174706
You don't need to add DOTALL modifier,
>>> text = """AZA:00zx---
ZZZ:32fd---
testxfiler
gsdkfklsd
fdsfsk
AZA:06x---
ZZZ:50----
gsdkfklsd
gsdkfklsd
fdsfsk
fdsfsk
gsdkfklsd
fdsfsk
ZZZ:32zzz----
fdsfsk
fdsfsk
gsdkfklsd
fdsfsk
AZA:46----
ZZZ:53---"""
>>> re.findall(r'(?:AZA:([0-9]+)[\S\s]*?)?ZZZ:([0-9]+)', text)
[('00', '32'), ('06', '50'), ('', '32'), ('46', '53')]
[\S\s]*
would match any space or non-space characters zero or more times.
Why your regex fails to work?
(?:AZA:([0-9]*))?.*?ZZZ:([0-9]*)
We all know that in DOTALL mode, dot in the regex will match even line breaks also. So by making (?:AZA:([0-9]*))?
as optional, the following .*?
would match all the preceding characters which are present before ZZZ:([0-9]*)
. So by including the following .*?
into the preceding optional group makes AZA:(\d+)
to match if it presents and the digits following AZA:
would be captured. Now, it won't do an unnecessary match.
Upvotes: 1
Reputation: 26667
A regex of the form
(?:AZA:(\d+)[^\n]*\n)?(?:ZZZ:)(\d+)[^\n]*
would be helpfull.
For example
>>>re.findall('(?:AZA:(\d+)[^\n]*\n)?(?:ZZZ:)(\d+)[^\n]*' ,x)
[('00', '32'), ('06', '50'), ('', '32'), ('46', '53')]
(?:AZA:(\d+)[^\n]*\n)?
matches :AZA:
followed by digits \d+
followed by anything other than \n
([^\n]
). The quantifier at the end ?
ensures that the entire group is optional. The digits are captured in group 1
(?:ZZZ:)(\d+)[^\n]*
matches :ZZZ:
followed by digit \d+
and anything other than \n
. Digits captured in group 2
What you missed
re.findall('(?:AZA:([0-9]*))?.*?ZZZ:([0-9]*)', text, re.DOTALL)
the entire (?:AZA:([0-9]*))?.*?
should have been made optional as
(?:AZA:([0-9]*))?.*?)?
followed by \n
changing your regex like
re.findall('(?:AZA:([0-9]*).*?)?\nZZZ:([0-9]*)' ,x)
will give output as
[('00', '32'), ('06', '50'), ('', '32'), ('46', '53')]
Upvotes: 1
Reputation: 67968
(?:AZA:(\d+).*?)?ZZZ:(\d+)
See demo
import re
p = re.compile(ur'(?:AZA:(\d+).*?)?ZZZ:(\d+)', re.DOTALL)
test_str = u"AZA:00zx---\nZZZ:32fd---\ntestxfiler\ngsdkfklsd\nfdsfsk\nAZA:06x---\nZZZ:50----\ngsdkfklsd\ngsdkfklsd\nfdsfsk\nfdsfsk\ngsdkfklsd\nfdsfsk\nZZZ:32zzz----\nfdsfsk\nfdsfsk\ngsdkfklsd\nfdsfsk\nAZA:46----\nZZZ:53---"
re.findall(p, test_str)
Upvotes: 3