Reputation: 11
I want to delete the linked hashtags
I do not want to delete hashtags without links.
I do not want to delete other links.
For example:
<p><a href="/user/username" >Username</a> #filmphotography #vintage <a href="/tag/travelgram" >#travelgram</a> #montreux #royalpalacehotel <a href="/tag/switzerland">#switzerland</a> #selfie <a href="/tag/meandmysister">#meandmysister</a></p>
I want to be:
<p><a href="/user/username" >Username</a> #filmphotography #vintage #montreux #royalpalacehotel #selfie </p>
This code doesn't work:
$html = preg_replace('#<a(.*?)>#(.*?)</a>#is', '', $html);
Upvotes: 0
Views: 155
Reputation: 47992
As indicated by rollstuhlfahrer, you have made the folly of using an unescaped character that is also the pattern delimiter. The easiest way to solve this is to change the delimiters to a valid delimiting character that is not used in the pattern itself (e.g. ~
).
Your new pattern will look like this: ~<a(.*?)>#(.*?)</a>~is
But there is more bad news...
Your output will be this:
<p> #montreux #royalpalacehotel #selfie </p>
The regex engine is trying to make you happy and it does its best to find matches for you. In doing so, it keeps extending its search beyond your intended qualifying tags and gobbles up non-qualifying tags too!
Here is the good news: DomDocument for the win!
Code: (Demo)
$html='<p><a href="/user/username" >Username</a> #filmphotography #vintage <a href="/tag/travelgram" >#travelgram</a> #montreux #royalpalacehotel <a href="/tag/switzerland">#switzerland</a> #selfie <a href="/tag/meandmysister">#meandmysister</a></p>';
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD); // 2nd params to remove DOCTYPE);
foreach($dom->getElementsByTagName('a') as $a){
if(strpos($a->nodeValue,'#')===0){
$remove[]=$a;
}
}
foreach($remove as $bad_a){
$bad_a->parentNode->removeChild($bad_a);
}
echo $dom->saveHTML();
Output:
<p><a href="/user/username">Username</a> #filmphotography #vintage #montreux #royalpalacehotel #selfie </p>
This trick is (and this hung me up for a little while, until the solution found me here: http://php.net/manual/en/domnode.removechild.php#90292 )
You must use two loops to remove the tags. The first to generate a list of tags to remove, then a second to do the removal.
Upvotes: 1
Reputation: 19315
Short answer using character set instead of quantifier
<a[^>]*>#[^<#]*<\/a>
it is more efficient because can't backtrack, and avoids backtracking unwanted results.
(.*?)
lazy quantifier means the shortest match but in case of backtracking it will contain bigger match because starting too early.
in case of unexpected matches regex may be improved.
Upvotes: 1