user2871676
user2871676

Reputation:

Python - why is regular expression much more efficient than search in list?

When I had a look at the Google Code Jam Qualification Round 2009 problem: Alien Language, the plain idea is that you generate all possible strings from the pattern, and hold it in a list, then test and count matched strings. This algorithm is simple, straight-forward, but it consumes incredibly large RAM and your laptop undoubtedly die.

Another way to solve this problem is to substitute "()" with "[]" for use of regular expression. Python embedded re.match() and string.replace() took less than a few seconds to pass the large test. Now the question, why is regular expression much more powerful??

In my understanding, there could be certain mechanism like yield function that enables you to generate a "generator" - iterable, and one-pass. But it's my guess.

Upvotes: 2

Views: 193

Answers (1)

Andrew Magee
Andrew Magee

Reputation: 6684

Your intuition is largely correct.

For the computer science details, you can look up the ideas of "nondeterministic finite automaton" and "deterministic finite automaton".

You can think of a regexp compiler as something that takes a regexp and produces a function which operates on your input string and keeps a state, which it updates as it consumes the input string based on rules derived from the regexp.

Hopefully I'm not butchering things too much if I say that conceptually, a regexp of, say, (ab|cd) would produce something that behaves like this:

def match_ab_or_cd(s):
    state = "start"
    for c in s:
        if state == "start":
            if c == "a":
                state = "state_a"
            elif c == "c":
                state = "state_c"
        elif state == "state_a":
            if c == "b":
                return True
            elif c == "a":
                state = "state_a"
            else:
                state = "start"
        elif state == "state_c":
            if c == "d":
                return True
            elif c == "c":
                state = "state_c"
            else:
                state = "start"
    return False


>>> match_ab_or_cd("ab")
True
>>> match_ab_or_cd("cd")
True
>>> match_ab_or_cd("ae")
False
>>> match_ab_or_cd("aaaaab")
True

So for many simple regexes it's possible to produce a machine that only needs to consume each character in the input string once. Bear in mind though that there are regexes that do not play nicely, such as (x+x+)+y.

Oh and here's a fun tool that visualises how regular expressions turn into state machines. You can think of the first picture it produces as an intermediate step, and the second picture is what can be translated into a machine like the above.

Upvotes: 1

Related Questions