Searching a Collection of Regexes Efficiently

If I have some unknown amount of regular expressions, (Zero or more and hopefully less than a few thousand) what is an efficient way to search for one that matches a given string?

What kinds of containers, algorithms and/or data structures should I be using? Is this different if I want to find the only matching regex than if I want to of all regex matches? Do those differ from just wanting to know how many matched?

Let me put that another way, Let's presume I have a user entering arbitrary strings and I have some container of regexes. I can design the container any way I choose and the search any way I choose. What should I do if I want a list of all regexes that match the user input from that collection how would I do that? What if I just wanted to know how many matches exists? What if I just wanted to insure uniqueness of a match?

Upvotes: 3

Answers (3)

Matt Timmermans

Reputation: 59184

If you can do some precomputation on your regular expressions before you try to match strings against them, then you can convert the union of all of them into a DFA which can match a string against all of them at the same time.

See: https://en.wikipedia.org/wiki/Deterministic_finite_automaton

This approach is very often used for lexical analysis (tokenization) in parsers and compilers. The benefit of a DFA is that it's the same speed (fast) no matter how many regexes you put into it or how complicated they were.

It's not so easy, but there are tools around. If you're working in Java, then I have an open source project that you might be able to use: http://mtimmerm.github.io/dfalex/ . To answer your other questions, you can get the the set of all matching regexes out of this if you want.

If you're interested in how to do it yourself, the process generally consists of converting your regular expressions into an NFA (https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton) using Thompson's construction (https://en.wikipedia.org/wiki/Thompson%27s_construction), and then converting the NFA into a DFA using subset construction (https://en.wikipedia.org/wiki/Powerset_construction), and then usually minimizing the DFA with Hopcroft's algorithm (https://en.wikipedia.org/wiki/DFA_minimization)

There's lots of room for optimization and finesse.

Good Luck!

P.S. I should note a couple things: 1) You can't generally make DFAs out of regexes that have back-references. 2) It's theoretically possible for the DFA to be exponentially big. This almost never happens by accident, but if your regexes are entered by potentially malicious people, then you would have to do something about that possibility.

Upvotes: 2

logi-kal

Reputation: 7880

A PHP example:

<?php
$regex_array = array(
  "/regex_1/" => 0,
  "/regex_2/" => 0,
  "/regex_3/" => 0    // and so on and so forth
);

$strings_array = array(
  "input_string_1",
  "input_string_2",
  "input_string_3"    // and so on and so forth
);

foreach ($regex_array as $key => $value)
  foreach ($strings_array as $current_string)
    if (preg_match($key, $current_string))
      $regex_array[$key]++;
?>

Here's the runnig code.

Upvotes: 0

Sqeaky

Reputation: 1936

I will not mark my own answer as an the answer unless no one beats it in a few days.

So far the only idea I have had worth anything is to put the regexes into one of two piles inside the container when it is added.

Into one pile goes every regex with some wildcard, character class or anything else that makes it deviate from conventional string. I will call this the RegexPile.

Into the other pile goes all the regexes that are strings or trivially convertible to string. Because strings are easy to match and algorithms are well understood I can say this pile will be and ordered container and will be sorted, and finding strings in it is trivial with a binary search. I will call this the SortedStringArray.

Naively, I can linearly search RegexPile and do a binary search on SortedStringArray. This at least allows me skip some comparisons and costs little in terms of time or space, but also doesn't much real optimization.

It is computationally similar, but if I do something like this I think I will launch a thread for each regex (or each small group of regexes) in RegexPile. My thinking is that any given regex can take an unbounded amount because regexes can do that. Then if any threads take too long I can fail based on a timeout and prematurely terminate all threads. I also think that that the majority will fail on the first character meaning that most of those threads will simply disappear once the first character is checked. With the cheap copy-on-write threads most systems provide nowadays, this thread spawning ought to be sufficiently cheap that many threads will close before I finish spawning them all and only the ones fairly similar will linger for any time. Then I do the binary in another thread for SortedStringArray.

Upvotes: 0

Searching a Collection of Regexes Efficiently

Answers (3)

Related Questions