Reputation: 656
I know "Dont use regex for html", but seriously, loading an entire html parser isn't always an option.
So, here is the scenario
<script...>
some stuff
</script>
<script...>
var stuff = '<';
anchortext
</script>
If you do this:
<script[^>]*?>.*?anchor.*?</script>
You will capture from the first script tag to the /script in the second block. Is there a way to do a .*? but by replacing the . with a match block, something like:
<script[^>]*?>(^</script>)*?anchor.*?</script>
I looked at negative lookaheads etc, but I can't get something to work properly. Usually I just use [^>]*? to avoid running past the closing block, but in this particular example, the script content has a "<" in it, and it stops matching on that before reaching the anchortext.
To simplify, I need something like [^z]*? but instead of a single character or character range, I need a capture group to fit a string.
.*?(?!z) doesn't have the same effect as [^z]*? as I assumed it would.
Here is where I am stuck at: http://regexr.com?34llp
Upvotes: 0
Views: 106
Reputation: 89547
like that:
$pattern = '~<script[^>]*+>((?:[^<]+?|<++(?!/script>))*?\banchor(?:[^<]+?|<++(?!/script>))*+)</script>~';
But DOM is the better way as far to do that.
Upvotes: 0
Reputation: 145482
Match-anything-but is indeed commonly implemented with a negative lookahead:
((?!exclude).)*?
The trick is to not have the .
dot repeated. But make it successively match any character while ensuring that character is not the beginning of the excluded word.
In your case you would want to have this instead of the initial .*?
<script[^>]*?>((?!</script>).)*?anchor.*?</script>
Upvotes: 3