tweak2
tweak2

Reputation: 656

matching html tag content in regex

I know "Dont use regex for html", but seriously, loading an entire html parser isn't always an option.

So, here is the scenario

<script...>
    some stuff
</script>

<script...>
    var stuff = '<';
    anchortext
</script>

If you do this:

<script[^>]*?>.*?anchor.*?</script>

You will capture from the first script tag to the /script in the second block. Is there a way to do a .*? but by replacing the . with a match block, something like:

<script[^>]*?>(^</script>)*?anchor.*?</script>

I looked at negative lookaheads etc, but I can't get something to work properly. Usually I just use [^>]*? to avoid running past the closing block, but in this particular example, the script content has a "<" in it, and it stops matching on that before reaching the anchortext.

To simplify, I need something like [^z]*? but instead of a single character or character range, I need a capture group to fit a string.

.*?(?!z) doesn't have the same effect as [^z]*? as I assumed it would.

Here is where I am stuck at: http://regexr.com?34llp

Upvotes: 0

Views: 106

Answers (2)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

like that:

$pattern = '~<script[^>]*+>((?:[^<]+?|<++(?!/script>))*?\banchor(?:[^<]+?|<++(?!/script>))*+)</script>~';

But DOM is the better way as far to do that.

Upvotes: 0

mario
mario

Reputation: 145482

Match-anything-but is indeed commonly implemented with a negative lookahead:

 ((?!exclude).)*?

The trick is to not have the . dot repeated. But make it successively match any character while ensuring that character is not the beginning of the excluded word.

In your case you would want to have this instead of the initial .*?

 <script[^>]*?>((?!</script>).)*?anchor.*?</script>

Upvotes: 3

Related Questions