Nyxynyx
Nyxynyx

Reputation: 63687

Inaccurate preg_match with '.jpg' pattern

I am usng preg_match with the pattern $pattern = '/src="http:\/\/(.*?).jpg"/s'; to grab urls of jpeg images off a webpage. However, this is not accurate enough as it also grabs http://www.domain.com/image.png"> Yadayada <img src="anotherpic.jpg.

Other times, it grabs stuff like

http://maps.google.com/maps/api/staticmap?center=42.34,-71.18&amp;path=weight:4|42.338,-71.177|42.338,-71.183|42.342,-71.183|42.342,-71.177|42.338,-71.177&amp;zoom=15&amp;size=335x225&amp;sensor=false" width="280" height="188" alt=""></td></tr> <tr><td height="10"></td></tr></table></td></tr></table></td></tr><tr><td height="10 valign="> </td></tr><tr><td valign="top" background="http://www.coolapartments.info/img/java-footer_bg.jpg

How can I improve the pattern to prevent unwanted matching like the 2 examples above?

Upvotes: 0

Views: 755

Answers (2)

Gordon
Gordon

Reputation: 317119

Use DOM and this XPath

//@src[contains(,. '.jpg')]

to match all src attributes of elements that contain the string ".jpg" somewhere.

If the attribute should end in ".jpg" use

//@src[substring(., string-length(.) - 4) = '.jpg']

which is the equivalent to the XPath 2.0 function ends-with.

The main benefit of using DOM and XPath is that it will only operate on src attributes, while your regex matches everywhere. There is plenty of usage examples for DOM and XPath here:

Upvotes: 2

Ludovic Kuty
Ludovic Kuty

Reputation: 4954

Replace the (.*?).jpg by ([^"]*)\.jpg to avoid crossing the double quote boundary of the src attribute. It could even be more generic with src="([^"]*)\.jpg", without matching the http.

Upvotes: 3

Related Questions