Discard all characters but the first 10 words before and after a search term

Question

I'm trying to finish the search function in one of the sites I'm developing. Since my search results only display excerpts of the contents of matched items, what I want to do is to highlight search terms within the search results and display only portions of texts that actually contain those search terms.

What I figured I'd do is to fetch the whole content from the database and use preg_replace to insert elements around the search terms and at the same time extract only the first 10 words before and after the term. So this is the regex part of it:

(?:.*?)((?:\w+\W+){0,10})('.implode('|', $terms).')((?:\W*\w+\W+){0,10})

Basically, I try to "discard" all text except the first 10 words before the search term by using a non-capturing subpattern, then get the 10 words before the term, then the term itself, then the next 10 words.

This is the replacement text in preg_replace:

\1\2\3...

The search term is being searched via the MySQL's MATCH()...AGAINST() for MyISAM FULLTEXT indeces on multiple columns. However, the above regex is only being applied in one column (let's call this column, the one that uses the above regex, content).

So my problem is whenever I get a match on other columns but not on the content column, the regex above strips all text from the content column. That's because of the (?:.*?) subpattern at the very beginning which continues to match without ever finding the next subpatterns.

I was wondering if there was any other way to implement the original purpose of the regex without this side effect. I am currently thinking of simply using preg_match_all to just match the search term and 10 words before and after it. I'll just iterate over all of the matches and build the preview text manually. Yes, this is a sound solution but given my inexperience with regex, I thought I might as well try to find a solution to this.

UPDATE

I just noticed that I only get blank contents when I put 2 or more search terms. Other than that, it works perfectly. I now have no idea why this is happening.

UPDATE 2

Echo'ing preg_last_error(), I get this error PREG_BACKTRACK_LIMIT_ERROR. I use the words new and post for the search terms.

A var_dump of the regex and the terms show this:

@(?:.*?)((?:\w+\W+){0,10})(new|post)((?:\W*\w+\W+){0,10})@i

array
  0 => string 'new' (length=3)
  1 => string 'post' (length=4)

UPDATE 3

I used Regex Coach to walk me through the matching pattern, it seems that it backtracks too much after it finds no match for (new|post). The target text is simply a random 3-paragraph lorem ipsum. I think I need to find a better regex for this task.

UPDATE 4

Using a Once-Only subpattern solves the problem. Though I have no idea of its details, I just re-read the PHP Manual and read a part of it that Once-Only subpatterns help with too much backtracking. This is the new regex:

(?:.*?)((?>\w+\W+){0,10})('.implode('|', $terms).')((?:\W*\w+\W+){0,10})

But I'm still open for suggestions for better regexes. Thanks!

Braiba · Accepted Answer

If you're having issues with hitting the backtracking limit, you generally want to look at once-only subpatterns.

In this case however your main issue seems to be the (?:.*?) being followed by (?:\w+\W+){0,10}. Take for example the string 'hello world!', ignoring for now the {0,10}. This will match the two patterns as all of the following:

'' and 'hello '
'h' and 'ello '
'he' and 'llo '
'hel' and 'lo '
'hell' and 'o '
'hello ' and 'world!'
'hello w' and 'orld!'
'hello wo' and 'rld!'
'hello wor' and 'ld!'
'hello worl' and 'd!'

The easiest way to block this redundant backtracking is to add a word boundary check (\b) after the (?:.*?) subpattern. This will reduce these potential matches to

'' and 'hello '
'hello ' and 'world!'

EDIT: Here is an example of why a once-only subpattern will not work here:

preg_replace('/(?>[a-z]{0,2})a/','x','bac')

In this example we would expect the result 'xc', however the subpattern matches greedily to 'ba' and then never backtracks, thus missing the match. We could make the pattern ungreedy, but then we would get the result 'bxc', because it never backtracks after matching '' for the subpattern.

Discard all characters but the first 10 words before and after a search term

Answers (1)

Related Questions