Reputation: 63687
I am usng preg_match
with the pattern $pattern = '/src="http:\/\/(.*?).jpg"/s';
to grab urls of jpeg images off a webpage. However, this is not accurate enough as it also grabs http://www.domain.com/image.png"> Yadayada <img src="anotherpic.jpg
.
Other times, it grabs stuff like
http://maps.google.com/maps/api/staticmap?center=42.34,-71.18&path=weight:4|42.338,-71.177|42.338,-71.183|42.342,-71.183|42.342,-71.177|42.338,-71.177&zoom=15&size=335x225&sensor=false" width="280" height="188" alt=""></td></tr> <tr><td height="10"></td></tr></table></td></tr></table></td></tr><tr><td height="10 valign="> </td></tr><tr><td valign="top" background="http://www.coolapartments.info/img/java-footer_bg.jpg
How can I improve the pattern to prevent unwanted matching like the 2 examples above?
Upvotes: 0
Views: 755
Reputation: 317119
Use DOM and this XPath
//@src[contains(,. '.jpg')]
to match all src attributes of elements that contain the string ".jpg" somewhere.
If the attribute should end in ".jpg" use
//@src[substring(., string-length(.) - 4) = '.jpg']
which is the equivalent to the XPath 2.0 function ends-with.
The main benefit of using DOM and XPath is that it will only operate on src attributes, while your regex matches everywhere. There is plenty of usage examples for DOM and XPath here:
Upvotes: 2
Reputation: 4954
Replace the (.*?).jpg
by ([^"]*)\.jpg
to avoid crossing the double quote boundary of the src
attribute. It could even be more generic with src="([^"]*)\.jpg"
, without matching the http
.
Upvotes: 3