matching html tag content in regex

Question

I know "Dont use regex for html", but seriously, loading an entire html parser isn't always an option.

So, here is the scenario


    some stuff



    var stuff = '<';
    anchortext

If you do this:

]*?>.*?anchor.*?

You will capture from the first script tag to the /script in the second block. Is there a way to do a .*? but by replacing the . with a match block, something like:

]*?>(^)*?anchor.*?

I looked at negative lookaheads etc, but I can't get something to work properly. Usually I just use [^>]*? to avoid running past the closing block, but in this particular example, the script content has a "<" in it, and it stops matching on that before reaching the anchortext.

To simplify, I need something like [^z]*? but instead of a single character or character range, I need a capture group to fit a string.

.*?(?!z) doesn't have the same effect as [^z]*? as I assumed it would.

Here is where I am stuck at: http://regexr.com?34llp

mario · Accepted Answer

Match-anything-but is indeed commonly implemented with a negative lookahead:

 ((?!exclude).)*?

The trick is to not have the . dot repeated. But make it successively match any character while ensuring that character is not the beginning of the excluded word.

In your case you would want to have this instead of the initial .*?

 ]*?>((?!).)*?anchor.*?

matching html tag content in regex

Answers (2)

Related Questions