Regexp with negative lookahead and xhtml

Question

I have the following regular expression which performs a negative lookahead.

/\b(\w+)\b(?![^<]*)/gsmi

What I want to do is to match all text including html except a, script, link and img. Now the problem occurs when an img tag is being used.

An image tag has no closing tag so the expression will not exclude the img tags.

This is a sample text with a link and an image  and so on

The regular expression should not match the anchor (not even between the opening and closing tag) and it should not match the img.

I think I am almost there but I can't get it to work properly. This is what I've tried as well:

/\b(\w+)\b(?![^<]*)(?![^\)/gsmi

Somehow the last one will only work (on img tag) when there is no "i" or "m" or "g" in the img tag. When you add something like height= it will not match.

Edit The goal is to extract all words from the text except those between anchor and image tags and there might be a chance that there is no html in it at all

user3942918 · Accepted Answer

I know you asked for a regex, but here is a solution using something that won't summon Cthulhu.

Example:

$html = <<<'HTML'
This is a sample text with
 a link and an image  and so on
HTML;

$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);

foreach($xpath->query('//a | //link | //script | //img') as $node) {
    $node->parentNode->removeChild($node);
}

echo $dom->saveHTML();

Output:

This is a sample text 
 a link and an image  and so on

I recommend considering it as an option.

Regexp with negative lookahead and xhtml

Answers (1)

Example:

Output:

Related Questions