Reputation: 792
I am performing a regex on a HTML string to fetch URL's. I want to fetch all href's and src's that are not javascript. From another SO post I have the following pattern:
/(href|src)?\="http:\/\/www\.mydomain\.com\/(?:(?!\.js).)*"/
Which fetches me results like:
src="http://www.mydomain.com/path/to/resource/image.gif" alt="" border="0"
This is good because it is missing the .js
results. It's bad because it's fetching additional tags in the element. I tried the following amendment to stop at the first "
:
/(href|src)?\="http:\/\/www\.mydomain\.com\/(?:(?!\.js).)[^"]*"/
It works in that it returns href="$url", but it returns results ending in .js
. Is there a way to combine a negative lookahead that says:
"
- i.e. [^"]*
; and.js"
Thanks in advance for any help/tips/pointers.
Upvotes: 4
Views: 5704
Reputation: 2749
Here's something a bit different. I used Debuggex with this expression:
(?:src|href)=(?&.quotStr)(?<!\.js")
which compiled it to this one:
$regex = '/(?:src|href)=(?:"((?:\\\\.|[^"\\\\]){0,})")(?<!\\.js")/';
Upvotes: 2
Reputation: 1841
add a "?" to the "*" before the last quote. This will make the "*" non-greedy, ie: it will stop matching at the first quote, not the last
/(href|src)?\="http:\/\/www\.mydomain\.com\/(?:(?!\.js).)*?"/
Upvotes: 4
Reputation: 29389
If you only want to reject .js
at the end of the string, you can use the following for the last part of the string match:
"(?![^"]*\.js").*?"
per this Rubular
Upvotes: 1
Reputation: 792
EDIT
See: https://stackoverflow.com/a/18838123/1163653 for a better solution.
Fixed it:
/(href|src)?\="http:\/\/www\.mydomain\.com\/(?:(?!\.js"|").)*"/
Note that the lookahead is checking for any string (after the domain) that doesn't contain .js
or "
, both of which would cause it to be invalid. It allows hrefs ending in .css
through as they only fail when they reach the first "
, which is the behaviour needed.
Upvotes: 0