Reputation: 64199
My text file contains 2 lines:
<IMG SRC="/icons/folder.gif" ALT="[DIR]"> <A HREF="yahoo.com.jp/">yahoo.com.jp/</A>
</PRE><HR>
In my Perl script, I have:
my $String =~ /.*(HREF=")(.*)(">)/;
print "$2";
and my output is the following:
Output 1: yahoo.com.jp
Output 2: ><HR>
What I am trying to achieve is have my Perl script automatically extract the string inside the <A Href="">
As I am very new to regex, I want to ask if my regex is a badly formed one? If so can someone provide some suggestion to make it look nicer?
Secondly, I do not know why my second output is "><HR>"
, I thought the expected behavior is that output2 will be skipped since it does not contain HREF=". Obviously I am very wrong.
Thanks for the help.
Upvotes: 2
Views: 5833
Reputation: 29854
If I may, I'd like to suggest the simplest way of doing this (it may not be the fastest or lightest-weight way): HTML::TreeBuilder::XPath
It gives you the power of XPath in non-well-formed HTML.
use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new_from_file( 'D:\Archive\XPath.pm.htm' );
my @hrefs = $tree->findvalues( '//div[@class="noprint"]/a/@href');
print "The links are: ", join( ',', @hrefs ), "\n";
Upvotes: 0
Reputation: 1853
To answer your specific question about why your regex isn't working, you're using .*
, which is "greedy" - it will by default match as much as you can. Alternatives would be using the non-greedy form, .*?
, or be a bit more exacting about what you're trying to match. For instance, [^"]*
will match anything that's not a double quote, which seems to be what you're looking for.
But yes, the other posters are correct - using regular expressions to do anything non-trivial in HTML parsing is a recipe for disaster. Technically you can do it properly, especially in Perl 5.10 (which has more advanced regular expression features), but it's usually not worth the headache.
Upvotes: 8
Reputation: 5488
When trying to match against HTML (or XML) with a regex you have to be careful about using . Rarely ever do you want a . because start is a greedy modifier that will match as far as it can. as Gumbo showed use the character class specifier [^"]* to match all characters except a quote. This will match till the end quote. You may also want to use something similar for matching the angle bracket. Try this:
/HREF="([^"]*)"[^>]*>/i
That should match much more consistently.
Upvotes: -1
Reputation: 30831
Using regular expressions to parse HTML works just often enough to lull you into a false sense of security. You can get away with it for simple cases where you control the input but you're better off using something like HTML::Parser instead.
Upvotes: 8