Jake Downs
Jake Downs

Reputation: 11

How to ignore regex matches wrapped by a particular string?

I had a great idea for some functionality on a project and I've tried to implement it to the best of my ability but I need a little help achieving the desired effect. The page in question is: http://dev.favorcollective.com/guidelines/ (just to provide some context)

I'm using php's preg_replace to go through a particular page's contents (giant string) and I'm having it search for glossary terms and then I wrap the terms with a bit of html that enables dynamic glossary definition tooltips.

Here is my current code:

function annotate($content)
{
    global $glossary_terms;
    $search =  array();
    $replace = array();
    $count=1;

    foreach ($glossary_terms as $term):
        array_push($search,'/\b('.preg_quote($term['term'],'/').')[?=a-zA-Z]*/i');
        $id = "annotation-".$count;
        $replacement = '<a href="'.get_bloginfo('url').'/glossary#'.preg_replace( '/\s+/', '', $term['term']).'" class="annotation" rel="'.$id.'">'.$term['term'].'</a><span id="'.$id.'" style="display:none;"><span class="term">'.$term['term'].'</span><span class="definition">'.$term['def'].'</span></span>';
         array_push($replace,(string)$replacement);

         $count++;

    endforeach;

    return preg_replace($search, $replace, $content);
}

• But what if I want to ignore matches inside of <h#> </h#> tags?

• I also have a particular string that I do not want a specific term to match within. For example, I want the word "proficiency" to match any time it is NOT used in the context of "ACTFL Proficiency Guidelines" how would I go about adding exceptions to my regular expression? Is that even an option?

• Finally, how can I return the matched text as a variable? Currently when I match for a term ending in 's' or 'ing' (on purpose) my script prints the matched term rather than the original string that was matched (i.e. it's replacing "descriptions" with "description"). Is there anyway to do that?

Upvotes: 1

Views: 528

Answers (2)

Scott Weaver
Scott Weaver

Reputation: 7361

not a php guy (c#), but here goes. I assume that:

'/\b('.preg_quote($term['term'],'/').')[?=a-zA-Z]*/i' will map to this far more readable pattern:

/\b(ESCAPED_TERM)[?=a-zA-Z]*/i

so, as far as excluding <h#> type tags, regex is ok only if you can assume your data would be the simple, non-nested case: <h#>TERM<h#>. If you can, you can use a negative lookahead assertion:

/\b(ESCAPED_TERM)(?!<h\d>)[?=a-zA-Z]*/i

you can use a lookahead with a lookbehind to handle your special case:

/\b(ESCAPED_TERM|(?<!ACTFL )Proficiency(?!\sGuidelines))(?!<h\d>)[?=a-zA-Z]*/i

note: if you have a bunch of these special cases, PHP might (should) have an "ignore whitespace" flag which will let you put each token on newline.

Upvotes: 3

ghoti
ghoti

Reputation: 46886

Regular expressions are awesome, wonderful, magical. But everything has its limits.

That's why it's nice to have a language like PHP to provide the extra functionality. :)

Can you strip out headers with a non-greedy regexp?

$content = preg_replace('/<h[1-6]>.*?<\/h[1-6]>/sim', "", $content);

If non-greedy evaluations aren't working, what about just assuming that there won't be any other HTML inside your headers?

$content = preg_replace('/<h[1-6]>[^<]*<\/h[1-6]>/im', "", $content);

Also, you might want to use sprintf to simplify your replacement:

/*
  1  get_bloginfo('url')
  2  preg_replace( '/\s+/', '', $term['term']).
  3  $id
  4  $term['term']
  5  $term['def']
*/
$rfmt = '<a href="%1$s/glossary#%2$s" class="annotation" rel="%3$s">%4$s</a><span id="%3$s" style="display:none;"><span class="term">%4$s</span><span class="definition">%5$s</span></span>';

...

$replacement = sprintf($rfmt, get_bloginfo('url'), preg_replace( '/\s+/', '', $term['term']), $id, $term['term'], $term['def'] );

Upvotes: 0

Related Questions