Reputation: 1353
Consider the following (highly simplified) string:
'a b a b c a b c a b c'
This is a repeating pattern of 'a b c'
except at the beginning where the 'c'
is missing.
I seek a regular expression which can give me the following matches by the use of re.findall()
:
[('a', 'b'), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]
The string above thus have 4 matches of 'a b c'
- although with the first match as a special case since the 'c'
is missing.
My simplest attempt is where I try to capture 'a'
and 'b'
and use an optional capture for 'c'
:
re.findall(r'(a).*?(b).*?(c)?', 'a b a b c a b c a b c')
I get:
[('a', 'b', ''), ('a', 'b', ''), ('a', 'b', ''), ('a', 'b', '')]
Clearly, it has just ignored the c
. When using non-optional capture for 'c'
the search skips ahead prematurely and misses 'a'
and 'b'
in the second 'a b c'
-substring. This results in 3 wrong matches:
[('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]
I have tried several other techniques (for instance, '(?<=c)'
) to no avail.
Note: The string above is just a skeleton example of my "real-world" problem where the three letters above are themselves strings (from a long log-file) in between other strings and newlines from which I need to extract named groups.
I use Python 3.5.2 on Windows 7.
Upvotes: 1
Views: 1219
Reputation: 627537
Since your a
, b
, and c
are placeholders, and you cannot know if those are single characters, or character sequences, or anything else, you need to use a tempered greedy token to make sure the pattern does not overflow to the other matches in the same string, and since the c
is optional, just wrap it with a (?:...)?
optional non-capturing group:
(a)(?:(?!a|b).)*(b)(?:(?:(?!a|b|c).)*(c))?
^^^^^^^^^^^^^ ^^^ ^^^^^^^^^^^^^^ ^
See the regex demo
Details:
(a)
- Group 1 capturing some a
(?:(?!a|b).)*
- a tempered greedy token matching any char not starting a a
or b
sequences(b)
- Group 2 capturing some b
(?:
- start of an optional non-capturing group, repeated 1 or 0 times
(?:(?!a|b|c).)*
- a tempered greedy token that matches any char but a newline that starts a a
, b
or c
pattern(c)
- Group 3 capturing some c
pattern)?
- end of the optional non-capturing group.To obtain the tuple list you need, you need to build it yourself using comprehension:
import re
r = r'(a)(?:(?!a|b).)*(b)(?:(?:(?!a|b|c).)*(c))?'
s = 'a b a b c a b c a b c'
# print(re.findall(r,s))
# That one is bad: [('a', 'b', ''), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]
print([(a,b,c) if c else (a,b) for a,b,c in re.findall(r,s)])
# This one is good: [('a', 'b'), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]
See the Python demo
Upvotes: 2