Kristian Rafteseth
Kristian Rafteseth

Reputation: 2032

preg_match pattern where a specific character should not come before another

I'm extracting some image-urls from a website. For that purpose I'm using this regex:

preg_match_all('#"(http.*?jpg)"#', $html, $matches);

However that will give a wrong result on lines like these:

<a href="http://omg.com/test.html"><img src="http://omg.com/image.jpg"></a>

I cannot search for <img tag because some images come from javascript.

But what is definite is that all the images is enclosed by two ""

So what would solve my problem is to change my regex to not allow any " characters between "http" and "jpg"

Something like this in pseudocode

preg_match_all('#"(http.?:(anything except ")?jpg)"#', $html, $matches);

How do you do this?

Upvotes: 1

Views: 61

Answers (2)

anubhava
anubhava

Reputation: 785008

You can use negation in your regex to make sure to not to match " between http and jpg:

preg_match_all('#"(http[^"]*jpg)"#i', $html, $matches); 

Regex Demo

As a word of caution though parsing HTML using regex is not the best way to scrap the web pages. You may consider using DOM parser.

Upvotes: 4

Avinash Raj
Avinash Raj

Reputation: 174696

You could try the below regex which uses a negated character class.

"(http[^<>]*jpg)"

DEMO

[^<>]* this ensures that there isn't a < or > symbols present in between the http and jpg strings.

Upvotes: 2

Related Questions