confused00
confused00

Reputation: 2612

Regex optional matches pattern

I'm trying to use the ? quantifier to match a pattern only if it exists, but I can't get it working as I want. In the example below I'm trying to extract pair of digits following AZA and ZZZ where ZZZ appears all the time, but AZA is optional. When AZA is missing, I just want to return a ('', [zzz-value]) pair (empty string instead of the AZA value):

Input:

AZA:00zx---
ZZZ:32fd---
testxfiler
gsdkfklsd
fdsfsk
AZA:06x---
ZZZ:50----
gsdkfklsd
gsdkfklsd
fdsfsk
fdsfsk
gsdkfklsd
fdsfsk
ZZZ:32zzz----
fdsfsk
fdsfsk
gsdkfklsd
fdsfsk
AZA:46----
ZZZ:53---

Desired output:

[(00,32), (06, 50), ('',32), (46,53)]

My attempt:

re.findall('(?:AZA:([0-9]*))?.*?ZZZ:([0-9]*)', text, re.DOTALL)

My output:

[('00', '32'), ('', '50'), ('', '32'), ('', '53')

Upvotes: 0

Views: 134

Answers (3)

Avinash Raj
Avinash Raj

Reputation: 174706

You don't need to add DOTALL modifier,

>>> text = """AZA:00zx---
ZZZ:32fd---
testxfiler
gsdkfklsd
fdsfsk
AZA:06x---
ZZZ:50----
gsdkfklsd
gsdkfklsd
fdsfsk
fdsfsk
gsdkfklsd
fdsfsk
ZZZ:32zzz----
fdsfsk
fdsfsk
gsdkfklsd
fdsfsk
AZA:46----
ZZZ:53---"""
>>> re.findall(r'(?:AZA:([0-9]+)[\S\s]*?)?ZZZ:([0-9]+)', text)
[('00', '32'), ('06', '50'), ('', '32'), ('46', '53')]

DEMO

[\S\s]* would match any space or non-space characters zero or more times.

Why your regex fails to work?

(?:AZA:([0-9]*))?.*?ZZZ:([0-9]*)

We all know that in DOTALL mode, dot in the regex will match even line breaks also. So by making (?:AZA:([0-9]*))? as optional, the following .*? would match all the preceding characters which are present before ZZZ:([0-9]*). So by including the following .*? into the preceding optional group makes AZA:(\d+) to match if it presents and the digits following AZA: would be captured. Now, it won't do an unnecessary match.

Upvotes: 1

nu11p01n73R
nu11p01n73R

Reputation: 26667

A regex of the form

(?:AZA:(\d+)[^\n]*\n)?(?:ZZZ:)(\d+)[^\n]* would be helpfull.

For example

>>>re.findall('(?:AZA:(\d+)[^\n]*\n)?(?:ZZZ:)(\d+)[^\n]*' ,x)
[('00', '32'), ('06', '50'), ('', '32'), ('46', '53')]
  • (?:AZA:(\d+)[^\n]*\n)? matches :AZA: followed by digits \d+ followed by anything other than \n([^\n]). The quantifier at the end ? ensures that the entire group is optional. The digits are captured in group 1

  • (?:ZZZ:)(\d+)[^\n]* matches :ZZZ: followed by digit \d+ and anything other than \n. Digits captured in group 2

What you missed

re.findall('(?:AZA:([0-9]*))?.*?ZZZ:([0-9]*)', text, re.DOTALL)

the entire (?:AZA:([0-9]*))?.*? should have been made optional as

(?:AZA:([0-9]*))?.*?)?

followed by \n

changing your regex like

re.findall('(?:AZA:([0-9]*).*?)?\nZZZ:([0-9]*)' ,x)

will give output as

[('00', '32'), ('06', '50'), ('', '32'), ('46', '53')]

Upvotes: 1

vks
vks

Reputation: 67968

(?:AZA:(\d+).*?)?ZZZ:(\d+)

See demo

import re
p = re.compile(ur'(?:AZA:(\d+).*?)?ZZZ:(\d+)', re.DOTALL)
test_str = u"AZA:00zx---\nZZZ:32fd---\ntestxfiler\ngsdkfklsd\nfdsfsk\nAZA:06x---\nZZZ:50----\ngsdkfklsd\ngsdkfklsd\nfdsfsk\nfdsfsk\ngsdkfklsd\nfdsfsk\nZZZ:32zzz----\nfdsfsk\nfdsfsk\ngsdkfklsd\nfdsfsk\nAZA:46----\nZZZ:53---"

re.findall(p, test_str)

Upvotes: 3

Related Questions