preg_match pattern where a specific character should not come before another

Question

I'm extracting some image-urls from a website. For that purpose I'm using this regex:

preg_match_all('#"(http.*?jpg)"#', $html, $matches);

However that will give a wrong result on lines like these:

I cannot search for tag because some images come from javascript.



But what is definite is that all the images is enclosed by two  ""

So what would solve my problem is to change my regex to not allow any " characters between "http" and "jpg"

Something like this in pseudocode

preg_match_all('#"(http.?:(anything except ")?jpg)"#', $html, $matches);


How do you do this?

anubhava · Accepted Answer

You can use negation in your regex to make sure to not to match " between http and jpg:

preg_match_all('#"(http[^"]*jpg)"#i', $html, $matches);

Regex Demo

As a word of caution though parsing HTML using regex is not the best way to scrap the web pages. You may consider using DOM parser.

preg_match pattern where a specific character should not come before another

Answers (2)

Related Questions