Mark
Mark

Reputation: 11

How can I remove external links from HTML using Perl?

I am trying to remove external links from an HTML document but keep the anchors but I'm not having much luck. The following regex

$html =~ s/<a href=".+?\.htm">(.+?)<\/a>/$1/sig;

will match the beginning of an anchor tag and the end of an external link tag e.g.

<a HREF="#FN1" name="01">1</a>
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->

so I end up with nothing instead of

<a HREF="#FN1" name="01">1</a>
some other html

It just so happens that all anchors have their href attribute in uppercase, so I know I can do a case sensitive match, but I don't want to rely on it always being the case in the future.

Is the something I can change so it only matches the one a tag?

Upvotes: 0

Views: 3322

Answers (5)

Павел П
Павел П

Reputation: 132

Even more simple, if you don't care about tag attributes:

$html =~ s/<a[^>]+>(.+?)<\/a>/$1/sig;

Upvotes: 0

Leonardo Herrera
Leonardo Herrera

Reputation: 8406

Yet another solution. I love HTML::TreeBuilder and family.

#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;

my $root = HTML::TreeBuilder->new_from_file(\*DATA);
foreach my $a ($root->find_by_tag_name('a')) {
    if ($a->attr('href') !~ /^#/) {
        $a->replace_with_content($a->as_text);
    }
}
print $root->as_HTML(undef, "\t");

__DATA__
<a HREF="#FN1" name="01">1</a>
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->
<a class="external" href="http://example.com">An example you
might not have considered</a>

<p>Maybe you did not consider <a
href="test.html">click here >>></a>
either</p>

Upvotes: 1

Axeman
Axeman

Reputation: 29854

A bit more like a SAX type parser is HTML::Parser:

use strict;
use warnings;

use English qw<$OS_ERROR>;
use HTML::Parser;
use List::Util qw<first>;

my $omitted;

sub tag_handler { 
    my ( $self, $tag_name, $text, $attr_hashref ) = @_;
    if ( $tag_name eq 'a' ) { 
        my $href = first {; defined } @$attr_hashref{ qw<href HREF> };
        $omitted = substr( $href, 0, 7 ) eq 'http://';
        return if $omitted;
    }
    print $text;
}

sub end_handler { 
    my $tag_name = shift;
    if ( $tag_name eq 'a' && $omitted ) { 
        $omitted = false;
        return;
    }
    print shift;
}

my $parser
    = HTML::Parser->new( api_version => 3
                       , default_h   => [ sub { print shift; }, 'text' ]
                       , start_h     => [ \&tag_handler, 'self,tagname,text,attr' ]
                       , end_h       => [ \&end_handler, 'tagname,text' ]
                       );
$parser->parse_file( $path_to_file ) or die $OS_ERROR;

Upvotes: 7

Sinan &#220;n&#252;r
Sinan &#220;n&#252;r

Reputation: 118128

Echoing Chris Lutz' comment, I hope the following shows that it is really straightforward to use a parser (especially if you want to be able to deal with input you have not yet seen such as <a class="external" href="...">) rather than putting together fragile solutions using s///.

If you are going to take the s/// route, at least be honest, do depend on href attributes being all upper case instead of putting up an illusion of flexibility.

Edit: By popular demand ;-), here is the version using HTML::TokeParser::Simple. See the edit history for the version using just HTML::TokeParser.

#!/usr/bin/perl

use strict; use warnings;
use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(\*DATA);

while ( my $token = $parser->get_token ) {
    if ($token->is_start_tag('a')) {
        my $href = $token->get_attr('href');
        if (defined $href and $href !~ /^#/) {
            print $parser->get_trimmed_text('/a');
            $parser->get_token; # discard </a>
            next;
        }
    }
    print $token->as_is;
}

__DATA__
<a HREF="#FN1" name="01">1</a>
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->
<a class="external" href="http://example.com">An example you
might not have considered</a>

<p>Maybe you did not consider <a
href="test.html">click here >>></a>
either</p>

Output:

C:\Temp> hjk
<a HREF="#FN1" name="01">1</a>
some other html
No. 155 <!-- end tag not necessarily on the same line -->
An example you might not have considered

<p>Maybe you did not consider click here >>>
either</p>

NB: The regex based solution you checked as ''correct'' breaks if the files that are linked to have the .html extension rather than .htm. Given that, I find your concern with not relying on the upper case HREF attributes unwarranted. If you really want quick and dirty, you should not bother with anything else and you should rely on the all caps HREF and be done with it. If, however, you want to ensure that your code works with a much larger variety of documents and for much longer, you should use a proper parser.

Upvotes: 11

Amber
Amber

Reputation: 526593

Why not just only remove links for which the href attribute doesn't begin with a pound sign? Something like this:

html =~ s/<a href="[^#][^"]*?">(.+?)<\/a>/$1/sig;

Upvotes: 0

Related Questions