Reputation: 258
I use a regex to extract <img src="img.jpg">
tags
Here is my regex
my @accept = $message_body =~ /<img src=\"\S*\">/gi;
Now my regex fails when the img tag is like this: <img src="cid:img.jpg">
Can any one tell me why?
Upvotes: 2
Views: 1139
Reputation: 14919
In case you missed n0rd's comment, here is the essential link again about the use of regular expressions with (X|HT)ML.
With that out of the way, here is one way to do it with a module (of course, just as TIMTOWTDI, there is also more than one module that would be suitable)
#!/usr/bin/perl
use strict;
use warnings;
use autodie qw(open);
use HTML::TreeBuilder::XPath;
my $file = shift or die "Missing argument! Usage: $0 FILENAME\n";
open( F, $file );
my $t=HTML::TreeBuilder::XPath->new();
$t->parse_file($file)
or die "Could not parse $file\n";
foreach my $img ( $t->findnodes( '//img' ) ) {
my $src = $img->attr('src');
my $width = $img->attr('width');
my $height = $img->attr('height');
print $img->as_HTML, "\n";
foreach my $attr ( qw(src width height alt title) ) {
print "$attr = ", $img->attr($attr), "\n" if defined($img->attr($attr));
}
print "\n";
}
Upvotes: 0
Reputation: 139531
The *
quantifier is greedy: it matches as much as it can while allowing the rest of the pattern to match. In your case, \S*
is likely consuming more text than you intended.
Consider using
my @accept = $message_body =~ /<img src="\S*?">/gi;
or
my @accept = $message_body =~ /<img src="[^"]+">/gi;
These patterns attempt to stop matching as soon as they detect a closing double-quote, but they are heuristics that could fail depending on how friendly your input is. To do the job properly, use an HTML parser.
Upvotes: 3
Reputation: 15000
The greedyness of \"\S*\"
says that it'll match as many non space characters as possible before the last "
appears in the string. You could change this to \".*?\"
which will match any characters upto the next "
.
I would completely overhaul your expression so that it would avoid some other difficult HTML edge cases.
This expression will:
>
or something that looks like an attribute inside an embedded javascript functionsrc
like hrefsrc="somevalue"
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=['"]([^"]*)['"])
construct allows multiple attributes to appear in any order inside the img tag.<img\b(?=\s)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=['"]([^"]*)['"])(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*\s?>
Live Example: http://www.rubular.com/r/bRmdy0YA0S
Sample Text
Note how the second image tag has some of the really difficult edge cases.
<img src="cid:img.jpg">
<img hrefsrc="NotMe.jpg" onmouseover=' src="NotTheMeEither.jpg" ; if ( 6 > x ) { funRotator(src) ; } ; ' src="cid:DifficultToFind.jpg">
Matches
[0][0] = <img src="cid:img.jpg">
[0][1] = cid:img.jpg
[1][0] = <img hrefsrc="NotMe.jpg" onmouseover=' src="NotTheMeEither.jpg" ; if ( 6 > x ) { funRotator(src) ; } ; ' src="cid:DifficultToFind.jpg">
[1][1] = cid:DifficultToFind.jpg
Upvotes: 4