Reputation: 45
Using Perl, how can I use a regex to take a string that has random HTML in it with one HTML link with anchor, like this:
<a href="http://example.com" target="_blank">Whatever Example</a>
and it leave ONLY that and get rid of everything else? No matter what was inside the href attribute with the <a, like title=
, or style=
, or whatever.
and it leave the anchor: "Whatever Example" and the </a>?
Upvotes: 1
Views: 155
Reputation: 118128
You can take advantage of a stream parser such as HTML::TokeParser::Simple:
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TokeParser::Simple;
my $html = <<EO_HTML;
Using Perl, how can I use a regex to take a string that has random HTML in it
with one HTML link with anchor, like this:
<a href="http://example.com" target="_blank">Whatever <i>Interesting</i> Example</a>
and it leave ONLY that and get rid of everything else? No matter what
was inside the href attribute with the <a, like title=, or style=, or
whatever. and it leave the anchor: "Whatever Example" and the </a>?
EO_HTML
my $parser = HTML::TokeParser::Simple->new(string => $html);
while (my $tag = $parser->get_tag('a')) {
print $tag->as_is, $parser->get_text('/a'), "</a>\n";
}
Output:
$ ./whatever.pl <a href="http://example.com" target="_blank">Whatever Interesting Example</a>
Upvotes: 2
Reputation: 977
If you need a simple regex solution, a naive approach might be:
my @anchors = $text =~ m@(<a[^>]*?>.*?</a>)@gsi;
However, as @dan1111 has mentioned, regular expressions are not the right tool for parsing HTML for various reasons.
If you need a reliable solution, look for an HTML parser module.
Upvotes: 1