Reputation: 13
I have the following regular expression which performs a negative lookahead.
/\b(\w+)\b(?![^<]*</{0,1}(a|script|link|img)>)/gsmi
What I want to do is to match all text including html except a, script, link and img. Now the problem occurs when an img tag is being used.
An image tag has no closing tag so the expression will not exclude the img tags.
<p>This is a sample text <a href="#">with</a> a link and an image <img src="" alt="" /> and so on</p>
The regular expression should not match the anchor (not even between the opening and closing tag) and it should not match the img.
I think I am almost there but I can't get it to work properly. This is what I've tried as well:
/\b(\w+)\b(?![^<]*</{0,1}(a|script|link)>)(?![^\<img]*>)/gsmi
Somehow the last one will only work (on img tag) when there is no "i" or "m" or "g" in the img tag. When you add something like height= it will not match.
Edit The goal is to extract all words from the text except those between anchor and image tags and there might be a chance that there is no html in it at all
Upvotes: 0
Views: 85
Reputation: 26413
I know you asked for a regex, but here is a solution using something that won't summon Cthulhu.
$html = <<<'HTML'
<p>This is a <em>sample</em> text <a href="#">with</a>
a link and an image <img src="" alt="" /> and so on</p>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//a | //link | //script | //img') as $node) {
$node->parentNode->removeChild($node);
}
echo $dom->saveHTML();
<p>This is a <em>sample</em> text
a link and an image and so on</p>
I recommend considering it as an option.
Upvotes: 0