kjakeb
kjakeb

Reputation: 7620

How can I find all matches to a regular expression in Python?

When I use the re.search() function to find matches in a block of text, the program exits once it finds the first match in the block of text.

How do I do this repeatedly where the program doesn't stop until ALL matches have been found? Is there a separate function to do this?

Upvotes: 553

Views: 584403

Answers (3)

cottontail
cottontail

Reputation: 23421

Another method (a bit in keeping with OP's initial spirit albeit 13 years later) is to compile the pattern and call search() on the compiled pattern and move along the pattern. This is a bit verbose but if you don't want a lookahead etc. or you want to search over a string more explicitly, then you can use the following function.

import re

def find_all_matches(pattern, string, group=0):
    pat = re.compile(pattern)
    pos = 0
    out = []
    while m := pat.search(string, pos):
        pos = m.start() + 1
        out.append(m[group])
    return out

pat = r'all (.*?) are'
s = 'all cats are smarter than dogs, all dogs are dumber than cats'
find_all_matches(pat, s)           # ['all cats are', 'all dogs are']
find_all_matches(pat, s, group=1)  # ['cats', 'dogs']

This works for overlapping matches too:

find_all_matches(r'(\w\w)', "hello")  # ['he', 'el', 'll', 'lo']

Upvotes: 11

a3nm
a3nm

Reputation: 8884

If you are interested in getting all matches (including overlapping matches, unlike @Amber's answer), there is a new library called REmatch which is specifically designed to produce all the matches of a regex on a text, including all overlapping matches. The tool supports a more general language of regular expressions with captures, called REQL.

For instance, the regexp !x{...} will give all triples of three contiguous characters (including overlapping triples).

The approach should be more efficient that @cottontail's answer (which is general quadratic in the input string).

You can try REmatch out online here and get the Python code here.

Disclaimer: I know the authors of the tool. :)

Upvotes: 0

Amber
Amber

Reputation: 527368

Use re.findall or re.finditer instead.

re.findall(pattern, string) returns a list of matching strings.

re.finditer(pattern, string) returns an iterator over MatchObject objects.

Example:

re.findall( r'all (.*?) are', 'all cats are smarter than dogs, all dogs are dumber than cats')
# Output: ['cats', 'dogs']

[x.group() for x in re.finditer( r'all (.*?) are', 'all cats are smarter than dogs, all dogs are dumber than cats')]
# Output: ['all cats are', 'all dogs are']

Upvotes: 905

Related Questions