HyderA
HyderA

Reputation: 21401

Large regex patterns: PCRC won't do it

I have a long list of words that I want to search for in a large string. There are about 500 words and the string is usually around 500K in size.

PCRE throws an error saying preg_match_all: Compilation failed: regular expression is too large at offset 704416

Is there an alternative to this? I know I can recompile PCRE with a higher internal linkage size, but I want to avoid messing around with server packages.

Upvotes: 0

Views: 894

Answers (4)

mrcrowl
mrcrowl

Reputation: 1476

Could you approach the problem from the other direction?

  1. Use regex to clean up your 500K of HTML and pull out all the words into a big-ass array. Something like \b(\w+)\b.. (sorry haven't tested that).

  2. Build a hash table of the 500 words you want to check. Assuming case doesn't matter, you would lowercase (or uppercase) all the words. The hash table could store integers (or some more complex object) to keep track of matches.

  3. Loop through each word from (1), lowercase it, and then match it against your hashtable.

  4. Increment the item in your hash table when it matches.

Upvotes: 2

xzyfer
xzyfer

Reputation: 14135

You can use str_word_count or explode the string on whitespace (or whatever dilimeter makes sense for the context of your document) then filter the results against you keywords.

$allWordsArray = str_word_count($content, 1);
$matchedWords = array_filter($allWordsArray, function($word) use ($keywordsArray) {
   return in_array($word, $keywordsArray);
});

This assume php5+ to use the closure, but this can be substituted for create_function in earlier versions of php.

Upvotes: 0

Amber
Amber

Reputation: 527213

Perhaps you might consider tokenizing your input string instead, and then simply iterating through each token and seeing if it's one of the words you're looking for?

Upvotes: 3

Wolph
Wolph

Reputation: 80081

You can try re2.

One of it's strengths is that uses automata theory to guarantee that the regex runs in linear time in comparison to it's input.

Upvotes: 0

Related Questions