Reputation: 430
I have an array of regexes. I have a large number of files I'd like to flag if any of the regexes match. Right now I just search each file with each regex.
It occurred to me there might be a way to e.g. build a tree that uses some fast pre-processing on the file to determine whether or not to search it with a particular regex. For example, all regexes that contain the letter A are on a particular branch and if the file doesn't contain the letter A then those regexes aren't applied.
Has anyone done any work on this? I'm forced to process the files using pure PHP and I have to walk the directory tree to process each file one by one. I can control the data structure the regexes are in and how they're used, but I need the flexibility of regex to do the final pattern matching.
Upvotes: 1
Views: 145
Reputation: 12592
You can try an aho-corasick algorithm if you can translate the reg exp into words, for example try a wildcard. Aho-corasick with a wildcard is quite simple. Just split the pattern at the wildcard and add them to the automaton. When searching you can use the states and the input position to compute the longest match prefix.
Upvotes: 1