Xiaobo Li
Xiaobo Li

Reputation: 69

How to force regex stop when hits a 'character' and continue from the start again

import re
match = re.findall(r'(a)(?:.*?(b)|.*?)(?:.*?(c)|.*?)(d)?', 
'axxxbxd,axxbxxcd,axxxxxd,axcxxx')
print (match)

output: [('a', 'b', 'c', 'd'), ('a', '', 'c', '')]

I want output as below:

[('a','b','','d'),('a','b','c','d'),('a','','','d'),('a','','c','')]

Each list starts with 'a' and has 4 items from the string separated by comma respectively.

Upvotes: 1

Views: 342

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626728

If you want to obtain several matches from a delimited string, either split the string with the delimiters first and run your regex, or replace the . with the [^<YOUR_DELIMITING_CHARS>] (paying attention to \, ^, ] and - that must be escaped). Also note that you can get rid of redundancy in the pattern using optional non-capturing groups.

Note that I assume that a, b and c are placeholders and the real life values can be both single and multicharacter values.

import re
s = 'axxxbxd,axxbxxcd,axxxxxd,axcxxx'
r = r'(a)(?:.*?(b))?(?:.*?(c))?(d)?'
print([re.findall(r, x) for x in s.split(',')])
print ([re.findall(r, x) for x in re.split(r'\W', s)])

# => [('a', 'b', '', ''), ('a', 'b', 'c', 'd'), ('a', '', '', ''), ('a', '', 'c', '')]

See the Python demo.

If your delimiters are non-word chars, use \W.

import re
s = 'axxxbxd,axxbxxcd,axxxxxd,axcxxx'
r = r'(a)(?:.*?(b)|.*?)(?:.*?(c)|.*?)(d)?'
print([re.findall(r, x) for x in s.split(',')])
print ([re.findall(r, x) for x in re.split(r'\W', s)])
# => [[('a', 'b', '', '')], [('a', 'b', 'c', 'd')], [('a', '', '', '')], [('a', '', 'c', '')]]

See the Python demo

If the strings can contain line breaks, pass re.DOTALL modifier to the re.findall calls.

Pattern details

  • (a) - Group 1 capturing a
  • (?:.*?(b))? - an optional non-capturing group matching a sequence of:
    • .*? - any char (other than line break chars if the re.S / re.DOTALL modifier is not used), zero or more occurrences, but as few as possible
    • (b) - Group 2: a b value
  • (?:.*?(c))?
    • .*? - any char (other than line break chars if the re.S / re.DOTALL modifier is not used), zero or more occurrences, but as few as possible
    • (c) - Group 3: a c value
  • (d)? - Group 4 (optional): a d.

Upvotes: 1

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

Considering that the crucial sequence a... b... c... d should be matched in strict order - use straight-forward approach:

s = 'axxxbxd,xxbxxcxxd,xxbxxxd|axcxxx'   # extended example
result = []
for seq in re.split(r'\W', s):           # split by non-word character
    result.append([c if c in seq else '' for c in ('a','b','c','d')])

print(result)

The output:

[['a', 'b', '', 'd'], ['', 'b', 'c', 'd'], ['', 'b', '', 'd'], ['a', '', 'c', '']]

Upvotes: 1

Related Questions