james
james

Reputation: 792

Regex lookahead with multiple negative conditions

I am performing a regex on a HTML string to fetch URL's. I want to fetch all href's and src's that are not javascript. From another SO post I have the following pattern:

/(href|src)?\="http:\/\/www\.mydomain\.com\/(?:(?!\.js).)*"/

Which fetches me results like:

src="http://www.mydomain.com/path/to/resource/image.gif" alt="" border="0"

This is good because it is missing the .js results. It's bad because it's fetching additional tags in the element. I tried the following amendment to stop at the first ":

/(href|src)?\="http:\/\/www\.mydomain\.com\/(?:(?!\.js).)[^"]*"/

It works in that it returns href="$url", but it returns results ending in .js. Is there a way to combine a negative lookahead that says:

Thanks in advance for any help/tips/pointers.

Upvotes: 4

Views: 5704

Answers (4)

Sergiu Toarca
Sergiu Toarca

Reputation: 2749

Here's something a bit different. I used Debuggex with this expression:

(?:src|href)=(?&.quotStr)(?<!\.js")

which compiled it to this one:

$regex = '/(?:src|href)=(?:"((?:\\\\.|[^"\\\\]){0,})")(?<!\\.js")/';

Live Demo

Upvotes: 2

HammerNL
HammerNL

Reputation: 1841

add a "?" to the "*" before the last quote. This will make the "*" non-greedy, ie: it will stop matching at the first quote, not the last

/(href|src)?\="http:\/\/www\.mydomain\.com\/(?:(?!\.js).)*?"/

Upvotes: 4

Peter Alfvin
Peter Alfvin

Reputation: 29389

If you only want to reject .js at the end of the string, you can use the following for the last part of the string match:

"(?![^"]*\.js").*?"

per this Rubular

Upvotes: 1

james
james

Reputation: 792

EDIT

See: https://stackoverflow.com/a/18838123/1163653 for a better solution.

Fixed it:

/(href|src)?\="http:\/\/www\.mydomain\.com\/(?:(?!\.js"|").)*"/

Note that the lookahead is checking for any string (after the domain) that doesn't contain .js or ", both of which would cause it to be invalid. It allows hrefs ending in .css through as they only fail when they reach the first ", which is the behaviour needed.

Upvotes: 0

Related Questions