Reputation: 21401
I have a long list of words that I want to search for in a large string. There are about 500 words and the string is usually around 500K in size.
PCRE throws an error saying preg_match_all: Compilation failed: regular expression is too large at offset 704416
Is there an alternative to this? I know I can recompile PCRE with a higher internal linkage size, but I want to avoid messing around with server packages.
Upvotes: 0
Views: 894
Reputation: 1476
Could you approach the problem from the other direction?
Use regex to clean up your 500K of HTML and pull out all the words into a big-ass array. Something like \b(\w+)\b.. (sorry haven't tested that).
Build a hash table of the 500 words you want to check. Assuming case doesn't matter, you would lowercase (or uppercase) all the words. The hash table could store integers (or some more complex object) to keep track of matches.
Loop through each word from (1), lowercase it, and then match it against your hashtable.
Increment the item in your hash table when it matches.
Upvotes: 2
Reputation: 14135
You can use str_word_count or explode the string on whitespace (or whatever dilimeter makes sense for the context of your document) then filter the results against you keywords.
$allWordsArray = str_word_count($content, 1);
$matchedWords = array_filter($allWordsArray, function($word) use ($keywordsArray) {
return in_array($word, $keywordsArray);
});
This assume php5+ to use the closure, but this can be substituted for create_function in earlier versions of php.
Upvotes: 0
Reputation: 527213
Perhaps you might consider tokenizing your input string instead, and then simply iterating through each token and seeing if it's one of the words you're looking for?
Upvotes: 3