Reputation: 5826
I have some large files (hundreds of MB) that I need to search for several thousand ~20-character unique strings.
I've found that using the pipe alternation metacharacter for matching regular expressions like (string1|string2|string3)
speeds up the search process a lot (versus searching for one string at a time).
What's the limit to how well this will scale? How many expressions can I chain together like this? Will it cause some kind of overflow at some point? Is there a better way to do this?
EDIT
In an effort to keep my question brief, I didn't emphasize the fact that I've already implemented code using this alternation approach and I found it to be helpful: On a test case with a typical data set, running time was reduced from 87 minutes down to 18 seconds--a 290x speedup, apparently with O(n) instead of O(n*m).
My question relates to how this approach can be expected to work when other users run this code in the future using much larger data sets with bigger files and more search terms. The original O(n*m) code was existing code that's been in use for 13 years, and its slowness was pointed out recently as the genome-related data sets it operates on have recently gotten much bigger.
Upvotes: 12
Views: 12680
Reputation: 126742
There is no theoretical limit to the extent of a regular expression, but practically it must fit within the limits of a specific platform and installation. You must find out empirically whether your plan will work, and I for one would be delighted to see your results.
One thing I would say is that you should compile the expression separately before you go on to use it. Either that or apply the /o
option to compile just once (i.e. promise that the contents of the expression won't change). Something like this
my $re = join '|', @strings;
foreach my $file (@files) {
my $fh = IO::File->new($file, '<') or die "Can't open $file: $!";
while (<$fh>) {
next unless /\b(?:$re)\b/io;
chomp;
print "$_ found in $file\n";
last;
}
}
Upvotes: 2
Reputation: 60017
If you are just going to have regular expression of the form (word1|word2|....|wordn), why not just create an associated array of booleans. That should be very fast.
EDIT
# before the loop, set up the hash
%words = (
cat => 1,
dog => 1,
apple => 1,
.... etc
);
# A the loop to check a sentence
foreach $aword (split(/ /, $sentence))
if ($words{$aword}) print "Found $aword\n";
Upvotes: 3
Reputation: 7392
If you have a simple regular expression like (word1|word2|...|wordn), the regex engine will construct a state machine that can just pass over the input once to find whether the string matches.
Side-note: in theoretical computer science, "regular expressions" are defined in a way such that a single pass is always sufficient. However, practical regex implementation add features that allow construction of regex patterns which can't be always implemented as a single pass (see this example).
Again, for your pattern of regular expressions, the engine will almost certainly use a single pass. That is likely going to be faster than reading the data from memory multiple times ... and almost definitely a lot faster than reading the data multiple times from disk.
Upvotes: 6