antoni
antoni

Reputation: 5546

Match multiple times a group only in single regex

Hi my question is simple:

I want to match all the possible hashtags in an article only if they are in a <figcaption> with PCRE regex. E.g:

<figcaption>blah blah #hashtag1, #hashtag2</figcaption>

I made an attempt here https://regex101.com/r/aL9vS8/1 and removing the last ? would change the capture from #hashtag1 to #hashtag2 but can't get both.

I am not even sure it is doable in one single regex in PHP.

Any idea to help me? :)

If there is no way in one single regex (really? even working with recursion (?R)?? :p), please suggest the most efficient way possible performance wise.

Thank you!

[EDIT]

If there is no way, my PHP next idea is to:

  1. Match every figcaption with preg_replace_callback
  2. In the callback match every instance of #hashtag.

Can I get your opinions on this? Is there a better way? my articles are not very long.

Upvotes: 2

Views: 768

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

Please suggest the most efficient way possible performance wise

The most reliable way to match some text in between some delimiters with PCRE regex is by using the custom boundaries with \G operator. However, the trailing boundary is a multicharacter string, and to match any text but the </figcaption> you'd need a tempered greedy token. Since this token is very resource consuming, it must be unrolled.

Here is a fast, reliable PCRE regex for your task:

(?:<figcaption|(?!^)\G)[^<#]*(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*\K#\w+

See the regex demo

Details:

  • (?:<figcaption|(?!^)\G) - Matches <figcaption or the end of the previous successful match
    More details:
    (?:<figcaption|(?!^)\G) is a non-capturing group ((?:...))that is meant to only group, not keep track of what was matched with this group (i.e. no value is kept in the group stack since the stack is not created) that matches 2 alternatives (| is an alternation operator): 1) literal text <figcaption or 2) (?!^)\G - a location after the previous successful match (note that \G also matches the start of the string, thus, we must add the negative lookahead (?!^) to exclude that behavior).
  • [^<#]* - 0+ chars other than < and #
  • (?:(?:<(?!\/figcaption>)|#\B)[^<#]*)* - 0+ sequences of:
    • (?:<(?!\/figcaption>)|#\B) - a < not followed with /figcaption> or # not followed with a word char
    • [^<#]* - 0+ chars other than < and #
  • \K - omit the text matched so far
  • #\w+ - # and 1+ word chars

Even more details:

The escape sequence \K causes any previously matched characters not to be included in the final matched sequence. For example, the pattern:

foo\Kbar

matches foobar, but reports that it has matched bar. This feature is similar to a lookbehind assertion.

  • (?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*: Here, we have an outer non-capturing group (?:...)* to enable matching a sequence of subpatterns zero or more times (we can set a quantifier * only to a grouping if we need to repeat a sequence of subpatterns) and the inner non-capturing group (?:<(?!\/figcaption>)|#\B)[^<#]* is just a way to shrink a longer <(?!\/figcaption>)[^<#]*|#\B[^<#]* (just to group 2 different alternatives <(?!\/figcaption>) and #\B before a common "suffix" [^<#]*.
  • Wrapping in a tag: just use preg_replace with the <span class="highlight">$0</span> replacement pattern:

Code:

$re = '~(?:<figcaption|(?!^)\G)[^<#]*(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*\K#\w+~'; 
$str = "<figcaption>blah # blah #hashtag1, #hashtag2</figcaption> #ee <figcaption>#ddddd"; 
$subst = "<span class=\"highlight\">$0</span>"; 
$result = preg_replace($re, $subst, $str);
echo $result;

See the PHP IDEONE demo

Upvotes: 2

Related Questions