Reputation: 2032
I'm extracting some image-urls from a website. For that purpose I'm using this regex:
preg_match_all('#"(http.*?jpg)"#', $html, $matches);
However that will give a wrong result on lines like these:
<a href="http://omg.com/test.html"><img src="http://omg.com/image.jpg"></a>
I cannot search for <img
tag because some images come from javascript.
But what is definite is that all the images is enclosed by two ""
So what would solve my problem is to change my regex to not allow any "
characters between "http" and "jpg"
Something like this in pseudocode
preg_match_all('#"(http.?:(anything except ")?jpg)"#', $html, $matches);
How do you do this?
Upvotes: 1
Views: 61
Reputation: 785008
You can use negation in your regex to make sure to not to match "
between http
and jpg
:
preg_match_all('#"(http[^"]*jpg)"#i', $html, $matches);
As a word of caution though parsing HTML using regex is not the best way to scrap the web pages. You may consider using DOM
parser.
Upvotes: 4
Reputation: 174696
You could try the below regex which uses a negated character class.
"(http[^<>]*jpg)"
[^<>]*
this ensures that there isn't a <
or >
symbols present in between the http
and jpg
strings.
Upvotes: 2