Reputation: 63687

Inaccurate preg_match with '.jpg' pattern

I am usng preg_match with the pattern $pattern = '/src="http:\/\/(.*?).jpg"/s'; to grab urls of jpeg images off a webpage. However, this is not accurate enough as it also grabs http://www.domain.com/image.png"> Yadayada <img src="anotherpic.jpg.

Other times, it grabs stuff like

http://maps.google.com/maps/api/staticmap?center=42.34,-71.18&path=weight:4|42.338,-71.177|42.338,-71.183|42.342,-71.183|42.342,-71.177|42.338,-71.177&zoom=15&size=335x225&sensor=false" width="280" height="188" alt=""></td></tr> <tr><td height="10"></td></tr></table></td></tr></table></td></tr><tr><td height="10 valign="> </td></tr><tr><td valign="top" background="http://www.coolapartments.info/img/java-footer_bg.jpg

How can I improve the pattern to prevent unwanted matching like the 2 examples above?

Upvotes: 0

Answers (2)

Gordon

Reputation: 317119

Use DOM and this XPath

//@src[contains(,. '.jpg')]

to match all src attributes of elements that contain the string ".jpg" somewhere.

If the attribute should end in ".jpg" use

//@src[substring(., string-length(.) - 4) = '.jpg']

which is the equivalent to the XPath 2.0 function ends-with.

The main benefit of using DOM and XPath is that it will only operate on src attributes, while your regex matches everywhere. There is plenty of usage examples for DOM and XPath here:

https://stackoverflow.com/search?q=xpath+OR+dom+php

Upvotes: 2

Ludovic Kuty

Reputation: 4954

Replace the (.*?).jpg by ([^"]*)\.jpg to avoid crossing the double quote boundary of the src attribute. It could even be more generic with src="([^"]*)\.jpg", without matching the http.

Upvotes: 3

Inaccurate preg_match with &#39;.jpg&#39; pattern

Answers (2)

Related Questions

Inaccurate preg_match with '.jpg' pattern