Reputation: 5546
Hi my question is simple:
I want to match all the possible hashtags in an article only if they are in a <figcaption>
with PCRE regex. E.g:
<figcaption>blah blah #hashtag1, #hashtag2</figcaption>
I made an attempt here https://regex101.com/r/aL9vS8/1 and removing the last ?
would change the capture from #hashtag1
to #hashtag2
but can't get both.
I am not even sure it is doable in one single regex in PHP.
Any idea to help me? :)
If there is no way in one single regex (really? even working with recursion (?R)?
? :p), please suggest the most efficient way possible performance wise.
Thank you!
[EDIT]
If there is no way, my PHP next idea is to:
preg_replace_callback
#hashtag
.Can I get your opinions on this? Is there a better way? my articles are not very long.
Upvotes: 2
Views: 768
Reputation: 626738
Please suggest the most efficient way possible performance wise
The most reliable way to match some text in between some delimiters with PCRE regex is by using the custom boundaries with \G
operator. However, the trailing boundary is a multicharacter string, and to match any text but the </figcaption>
you'd need a tempered greedy token. Since this token is very resource consuming, it must be unrolled.
Here is a fast, reliable PCRE regex for your task:
(?:<figcaption|(?!^)\G)[^<#]*(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*\K#\w+
See the regex demo
Details:
(?:<figcaption|(?!^)\G)
- Matches <figcaption
or the end of the previous successful match
(?:<figcaption|(?!^)\G)
is a non-capturing group ((?:...)
)that is meant to only group, not keep track of what was matched with this group (i.e. no value is kept in the group stack since the stack is not created) that matches 2 alternatives (|
is an alternation operator): 1) literal text <figcaption
or 2) (?!^)\G
- a location after the previous successful match (note that \G
also matches the start of the string, thus, we must add the negative lookahead (?!^)
to exclude that behavior).[^<#]*
- 0+ chars other than <
and #
(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*
- 0+ sequences of:
(?:<(?!\/figcaption>)|#\B)
- a <
not followed with /figcaption>
or #
not followed with a word char[^<#]*
- 0+ chars other than <
and #
\K
- omit the text matched so far#\w+
- #
and 1+ word charsEven more details:
\K
:The escape sequence
\K
causes any previously matched characters not to be included in the final matched sequence. For example, the pattern:foo\Kbar
matches
foobar
, but reports that it has matchedbar
. This feature is similar to a lookbehind assertion.
(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*
: Here, we have an outer non-capturing group (?:...)*
to enable matching a sequence of subpatterns zero or more times (we can set a quantifier *
only to a grouping if we need to repeat a sequence of subpatterns) and the inner non-capturing group (?:<(?!\/figcaption>)|#\B)[^<#]*
is just a way to shrink a longer <(?!\/figcaption>)[^<#]*|#\B[^<#]*
(just to group 2 different alternatives <(?!\/figcaption>)
and #\B
before a common "suffix" [^<#]*
.preg_replace
with the <span class="highlight">$0</span>
replacement pattern:Code:
$re = '~(?:<figcaption|(?!^)\G)[^<#]*(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*\K#\w+~';
$str = "<figcaption>blah # blah #hashtag1, #hashtag2</figcaption> #ee <figcaption>#ddddd";
$subst = "<span class=\"highlight\">$0</span>";
$result = preg_replace($re, $subst, $str);
echo $result;
See the PHP IDEONE demo
Upvotes: 2