O. Th. B.
O. Th. B.

Reputation: 1353

Regex for optional end-part of substring

Consider the following (highly simplified) string:

'a b a b c a b c a b c'

This is a repeating pattern of 'a b c' except at the beginning where the 'c' is missing.

I seek a regular expression which can give me the following matches by the use of re.findall():

[('a', 'b'), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]

The string above thus have 4 matches of 'a b c' - although with the first match as a special case since the 'c' is missing.

My simplest attempt is where I try to capture 'a' and 'b' and use an optional capture for 'c':

re.findall(r'(a).*?(b).*?(c)?', 'a b a b c a b c a b c')

I get:

[('a', 'b', ''), ('a', 'b', ''), ('a', 'b', ''), ('a', 'b', '')]

Clearly, it has just ignored the c. When using non-optional capture for 'c' the search skips ahead prematurely and misses 'a' and 'b' in the second 'a b c'-substring. This results in 3 wrong matches:

[('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]

I have tried several other techniques (for instance, '(?<=c)') to no avail.

Note: The string above is just a skeleton example of my "real-world" problem where the three letters above are themselves strings (from a long log-file) in between other strings and newlines from which I need to extract named groups.

I use Python 3.5.2 on Windows 7.

Upvotes: 1

Views: 1219

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627537

Since your a, b, and c are placeholders, and you cannot know if those are single characters, or character sequences, or anything else, you need to use a tempered greedy token to make sure the pattern does not overflow to the other matches in the same string, and since the c is optional, just wrap it with a (?:...)? optional non-capturing group:

(a)(?:(?!a|b).)*(b)(?:(?:(?!a|b|c).)*(c))?
   ^^^^^^^^^^^^^   ^^^ ^^^^^^^^^^^^^^    ^

See the regex demo

Details:

  • (a) - Group 1 capturing some a
  • (?:(?!a|b).)* - a tempered greedy token matching any char not starting a a or b sequences
  • (b) - Group 2 capturing some b
  • (?: - start of an optional non-capturing group, repeated 1 or 0 times
    • (?:(?!a|b|c).)* - a tempered greedy token that matches any char but a newline that starts a a, b or c pattern
    • (c) - Group 3 capturing some c pattern
  • )? - end of the optional non-capturing group.

To obtain the tuple list you need, you need to build it yourself using comprehension:

import re
r = r'(a)(?:(?!a|b).)*(b)(?:(?:(?!a|b|c).)*(c))?'
s = 'a b a b c a b c a b c'
# print(re.findall(r,s))
# That one is bad: [('a', 'b', ''), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]
print([(a,b,c) if c else (a,b) for a,b,c in re.findall(r,s)])
# This one is good: [('a', 'b'), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]

See the Python demo

Upvotes: 2

Related Questions