Eric
Eric

Reputation: 2116

Why does this regular expression work?

OK I'm thoroughly on why this regular expression works. The text I'm working with is this:

<html>
  <body>
    hello
    <img src="withalt" alt="hi"/>asdf
    <img src="noalt" />fdsa<a href="asdf">asdf</a>
    <img src="withalt2" alt="blah" />
  </body>
</html>

Using the following regular expression (tested in php but I'm assuming it's true for all perl regular expressions), it will return all img tags which do not contain an alt tag:

/<img(?:(?!alt=).)*?>/
Returns:
<img src="noalt" />

So based on that I would think that simply removing the no backreference would return the same:

/<img(?!alt=).*?>/
Returns:
<img src="withalt" alt="hi"/>
<img src="noalt" />
<img src="withalt2" alt="blah" />

As you see instead it just returns all image tags. Then to make things even more confusing, removing the ? (simply a wildcard as far as I'm aware) after the * returns up to the final >

/<img(?!alt=).*>/
Returns:
<img src="withalt" alt="hi"/>
<img src="noalt" />fdsa<a href="asdf">asdf</a>
<img src="withalt2" alt="blah" />

So anyone care to inform me, or at least point me in the right direction of what's going on here?

Upvotes: 3

Views: 85

Answers (1)

Rohit Jain
Rohit Jain

Reputation: 213261

/<img(?:(?!alt=).)*?>/

This regex applies negative look-ahead for each character it matches after img. So, as soon as it finds alt=, it stops. So, it will only match the img tag, that does not have an alt attribute.

/<img(?!alt=).*?>/

This regex, just applies the negative look-ahead after img. So, it will match everything till the first > for all the img tag which is not followed by alt=, no matter whether alt= appears anywhere further down the string. It will be covered in .*?

/<img(?!alt=).*>/

This is same as the previous one, but it matches everything till the last >, since it uses greedy matching. But I don't know why you got that output. You should have got everything till the last > for </html>.


Now forget everything that happened there, and move towards an HTML Parser, for parsing an HTML. They are specifically designed for this task. So, don't bother using regex, because you can't parse every kind of HTML's through regex.

Upvotes: 2

Related Questions