Why does this regular expression work?

Question

OK I'm thoroughly on why this regular expression works. The text I'm working with is this:


  
    hello
    asdf
    fdsaasdf

Using the following regular expression (tested in php but I'm assuming it's true for all perl regular expressions), it will return all img tags which do not contain an alt tag:

//
Returns:

So based on that I would think that simply removing the no backreference would return the same:

//
Returns:

As you see instead it just returns all image tags. Then to make things even more confusing, removing the ? (simply a wildcard as far as I'm aware) after the * returns up to the final >

//
Returns:

fdsaasdf

So anyone care to inform me, or at least point me in the right direction of what's going on here?

Rohit Jain · Accepted Answer

//

This regex applies negative look-ahead for each character it matches after img. So, as soon as it finds alt=, it stops. So, it will only match the img tag, that does not have an alt attribute.

//

This regex, just applies the negative look-ahead after img. So, it will match everything till the first > for all the img tag which is not followed by alt=, no matter whether alt= appears anywhere further down the string. It will be covered in .*?

//

This is same as the previous one, but it matches everything till the last >, since it uses greedy matching. But I don't know why you got that output. You should have got everything till the last > for .

Now forget everything that happened there, and move towards an HTML Parser, for parsing an HTML. They are specifically designed for this task. So, don't bother using regex, because you can't parse every kind of HTML's through regex.

Why does this regular expression work?

Answers (1)

Related Questions