Reputation: 1040

Use python regex to match all possible groups of 1, 2, 3, and 4 words

How do I use python regex to match all possible sequences of 1, 2, 3, and 4 words in a string? All sequences must be of adjacent words only. So:

str1 = 'AA BB CC DD EE FF GG HH'
matches = re.findall(r'insert ninja regex here', str1)
for match in matches:
    print match

Should output:

AA
AA BB
BB
AA BB CC
BB CC
CC
AA BB CC DD
BB CC DD
CC DD
DD
BB CC DD EE
CC DD EE
DD EE
EE
... etc etc

Thanks

Possible solution with four regexes (let me know if you have a more efficient and faster way of doing this):

matches4 = re.findall(r'(?=((?:\s\S+){3}\s\S+))', str1)
matches3 = re.findall(r'(?=((?:\s\S+){2}\s\S+))', str1)
matches2 = re.findall(r'(?=((?:\s\S+){1}\s\S+))', str1)
matches1 = re.findall(r'(?=(\s\S+))', str1)

THE RESULTS ARE IN:

I ran all 4 answers on a string with 138.2k characters and 22.2k words:
my answer=0.0856201648712s.
zx81 answer option 1=0.0598151683807s.
zx81 answer option 2=0.0905468463898s.
Greg Hewgill answer=0.0292818546295s.

THE WINNER IS GREG! However, zx81 gets the answer check for a regex solution. You all got an up vote.

Upvotes: 1

Answers (3)

user2104778

Reputation: 1040

IS efficient - solution with four regexes (let me know if you have a more efficient and faster way of doing this):

matches4 = re.findall(r'(?=((?:\s\S+){3}\s\S+))', str1)
matches3 = re.findall(r'(?=((?:\s\S+){2}\s\S+))', str1)
matches2 = re.findall(r'(?=((?:\s\S+){1}\s\S+))', str1)
matches1 = re.findall(r'(?=(\s\S+))', str1)

Upvotes: 0

zx81

Reputation: 41838

Overlapping Matches: Lookahead Trickery

It is possible to match all the combinations with a single regex. Since the matches overlap, we will use lookaheads and optional capture groups.

Here are two options.

Option 1: Shorter Regex, Unsorted Ouput

(?=((((\b\S+)(?: \S+)?)(?: \S+)?)(?: \S+)?))

All the combinations are captured to Groups 1, 2, 3 and 4 of the various matches. A few dupes are eliminated when we bring all the groups together into a set. In the regex demo, please look at the group captures in the bottom right pane.

Sample Code

import re
subject = "AA BB CC DD EE FF GG HH"
reobj = re.compile(r"(?=((((\b\S+)(?: \S+)?)(?: \S+)?)(?: \S+)?))")
result = reobj.findall(subject)
tokens = set()
for a in result:
    for b in a:
        tokens.add(b)
print(tokens)

Output

{'CC DD EE', 'EE FF GG HH', 'GG', 'DD EE FF',
'FF', 'DD EE FF GG', 'BB CC DD', 'DD EE', 'FF GG',
'CC', 'FF GG HH', 'HH', 'EE FF GG', 'AA BB', 'CC DD',
'AA BB CC', 'DD', 'GG HH', 'AA', 'BB CC DD EE', 
'EE FF', 'EE', 'AA BB CC DD', 'BB', 'BB CC', 
'CC DD EE FF'}

Option 2: Longer Regex, Sorted Output

(?=\b(\S+(?: \S+){3}))?(?=\b(\S+(?: \S+){2}))?(?=\b(\S+ \S+))?(?=(\b\S+))

When we build the list, a few empty strings need to be removed.

Sample Code

import re
subject = "AA BB CC DD EE FF GG HH"
reobj = re.compile(r"(?=(\b\S+))(?=\b(\S+ \S+))?(?=\b(\S+(?: \S+){2}))?(?=\b(\S+(?: \S+){3}))?")
result = reobj.findall(subject)
tokens = []
for a in result:
    for b in a:
        if b != "":
            tokens.append(b)
print(tokens)

Output

['AA', 'AA BB', 'AA BB CC', 'AA BB CC DD', 
 'BB', 'BB CC', 'BB CC DD', 'BB CC DD EE',
 'CC', 'CC DD', 'CC DD EE', 'CC DD EE FF', 
 'DD', 'DD EE', 'DD EE FF', 'DD EE FF GG',
 'EE', 'EE FF', 'EE FF GG', 'EE FF GG HH', 
 'FF', 'FF GG', 'FF GG HH', 
 'GG', 'GG HH', 
 'HH']

Upvotes: 4

Greg Hewgill

Reputation: 993403

You can do this without using any regexes, and it's probably easier to do this way.

>>> str1 = 'AA BB CC DD EE FF GG HH'
>>> a = str1.split()
>>> [" ".join(a[i:i+n]) for n in range(1, 5) for i in range(len(a)-n+1)]
['AA', 'BB', 'CC', 'DD', 'EE', 'FF', 'GG', 'HH', 'AA BB', 'BB CC', 'CC DD', 'DD EE', 'EE FF', 'FF GG', 'GG HH', 'AA BB CC', 'BB CC DD', 'CC DD EE', 'DD EE FF', 'EE FF GG', 'FF GG HH', 'AA BB CC DD', 'BB CC DD EE', 'CC DD EE FF', 'DD EE FF GG', 'EE FF GG HH']

In any case, it's certainly easier to read than any regex for this.