lquerel
lquerel

Reputation: 183

Efficient algorithm for string matching with a very large pattern set

I'm looking for an efficient algorithm able to find all patterns that match a specific string. The pattern set can be very large (more than 100,000) and dynamic (patterns added or removed at anytime). Patterns are not necessarily standard regexp, they can be a subset of regexp or something similar to shell pattern (ie: file-*.txt). A solution for a subset of regex is preferred (as explained below).

FYI: I'm not interested by brute force approaches based on a list of RegExp.

By simple regexp, I mean a regular expression that supports ?, *, + , character classes [a-z] and possibly the logical operator |.

To clarify my need: I wish find all patterns that match the URL:

http://site1.com/12345/topic/news/index.html

The response should be these patterns based on the pattern set below.

http://*.site1.com/*/topic/*
http://*.site1.com/* 
http://*

Pattern set:

http://*.site1.com/*/topic/*
http://*.site1.com/*/article/*
http://*.site1.com/* 
http://*.site2.com/topic/*
http://*.site2.com/article/*
http://*.site2.com/* 
http://*

Upvotes: 6

Views: 3492

Answers (3)

Zeke Medley
Zeke Medley

Reputation: 371

Here is an approach we've used pretty successfully (implementation here):

Adding Patterns:

For any pattern there exists a set of sub-strings a string must contain in order to have a chance of matching against it. Call these meta words. For example:

dog*fish -> [dog, fish]
[lfd]og  -> [og]
dog?     -> [dog]

When you add a pattern to the data structure, break it up into meta words and store them in a Aho-Corasick string matching data structure. Maintain an internal data structure to map meta words back to pattern words.

Running Queries:

Given an input string, use the Aho-Corasick data structure you've built to get all the meta words contained in that string. Then, using the map you've created, test the patterns that correspond to those meta words.

This works well because while string matching is fairly slow, you can narrow down the number of patterns you actually have to match against very quickly. Our implementation can perform about 200,000 queries per second, on a regular laptop, against sets of 150,000+ patterns. See the bench-marking mode in the program to test that.

Upvotes: 3

John Sobolewski
John Sobolewski

Reputation: 4572

One approach that comes to mind is to create tree structures of patterns.

Example: http://* would contain all the patterns (listed above). http://*.site1.com/* would contain all the site1.com ones. This could significantly reduce the number of patterns that need to be checked.

Additionally you could determine which patters are mutually exclusive to further prune the list you search.

So first take all the patterns and create trees out of them. Search all roots to determine which branches and nodes need to be analyzed.

Improve the algorithm by determining which branches are mutually exclusive so once you find a hit on a given branch you would know which branches/nodes do not need to be visited.

To get started you could be lazy and your first pass could be to sort the patterns and do simple does next pattern contain this pattern type logic to determine if "this" is contained in next. EX: if( "http://*.site1.com/*".startsWith("http://*") == true )

You could get more sophisticated in your ability to determine if one pattern does actually contain another but this would get you started.

To get any better at determining the question:

"Does this pattern contain that pattern?"

I believe you would need to be able to parse the regex... This article looks like a good place to start with to understand how to accomplish that: Parsing regular expressions with recursive descent

Upvotes: 2

Ira Baxter
Ira Baxter

Reputation: 95334

If the set of URLs doesn't change very fast, you really should use a regex engine that compiles down its patterns. Java provides one of these, but it might not be satisfactory if you want to know which pattern matches.

A widely used mechanism for doing this and determining which match, are various lexer generators, e.g., FLEX and similar tools. They accept what amount to a regex for each "lexeme", and build an integrated FSA to recognize any of them which is extremely efficient to execute.

You could invoke Flex when your set changes. If that's too slow, get an open source version of Flex and integrate into your engine; it internally builds the FSA so you could use it directly. (Some engineering likely necessary). But if you really have a high performance matching problem, some work to do it well won't bother you.

If the set of URLs changes faster than FLEX can generate FSAs (odd), you have a real problem. In that case, you might build an online discrimination tree by scanning the "regex" from left to right and integrating the characters/predicates you see into your existing discrimination tree. Matching then consists of walking down the discrimination tree, performing the various tests; if you arrive at a leaf, you have a match, otherwise not. This might be just as fast as a FLEX-generated automation if done right, but likely a lot, lot bigger.

Upvotes: 0

Related Questions