Clark Ventura
Clark Ventura

Reputation: 3

Perl: Replacing links (html) that meet certain criteria

On my forum, I want to automatically add rel="nofollow" to links that point to external sites. For instance, someone creates a post with the following text:

Link 1: <a href="http://www.external1.com" target="_blank">External Link 1</A>
Link 2: <a href="http://www.myforum.com">Local Link 1</A>
Link 3: <a href="http://www.external2.com">External Link 2</A>
Link 4: <a href="http://www.myforum.com/test" ALT="Local">Local Link 2</A>

Using Perl, I want that changed to:

Link 1: <a href="http://www.external1.com" target="_blank" rel="nofollow">External Link 1</A>
Link 2: <a href="http://www.myforum.com">Local Link 1</A>
Link 3: <a href="http://www.external2.com" rel="nofollow">External Link 2</A>
Link 4: <a href="http://www.myforum.com/test" ALT="Local">Local Link 2</A>

I can do this using quite a few lines of code, but I was hoping I could do this with one or more regexes. But I can't figure out how.

Upvotes: 0

Views: 652

Answers (2)

amon
amon

Reputation: 57600

Regexes can work in limited scenarios, but you should never use regexes to parse HTML

Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp.

    — from RegEx match open tags except XHTML self-contained tags

I am quite fond of the Mojo suite, because this allows us to use a proper parser with very little code. We can the use CSS selectors to find interesting elements:

use strict; use warnings;
use autodie;
use Mojo;
use File::Slurp;

for my $filename (@ARGV) {
  my $dom = Mojo::DOM->new(scalar read_file $filename);

  for my $link ($dom->find('a[href]')->each) {
    $link->attr(rel => 'nofollow')
      if $link->attr('href') !~ m(\Ahttps?://www[.]myforum[.]com(?:/|\z));
  }

  write_file "$filename~", "$dom";
  rename "$filename~" => $filename;
}

Invocation: perl mark-links-as-nofollow.pl *.html A test run on your data produces the output:

Link 1: <a href="http://www.external1.com" rel="nofollow" target="_blank">External Link 1</a>
Link 2: <a href="http://www.myforum.com">Local Link 1</a>
Link 3: <a href="http://www.external2.com" rel="nofollow">External Link 2</a>
Link 4: <a alt="Local" href="http://www.myforum.com/test">Local Link 2</a>

Why did I use tempfiles and rename? On most file systems, a file can be renamed atomically, whereas writing to a file takes some time. So other processes might see a half-written file.

Upvotes: 1

ukautz
ukautz

Reputation: 2213

I'd use a regex gobal and eval flag for callback, eg like so:

#!/usr/bin/perl

use strict;

my $internal_link = qr'href="https?:\/\/(?:www\.)?myforum\.com';

my $html = '
Lorem ipsum
<a href="http://www.external1.com" target="_blank">External Link 1</A>
Lorem ipsum
<a href="http://www.myforum.com">Local Link 1</A>
Lorem ipsum
<a href="http://www.external2.com">External Link 2</A>
Lorem ipsum
<a href="http://www.myforum.com/test" ALT="Local">Local Link 2</A>
';

$html =~ s/<a ([^>]+)>/"<a ". replace_externals($1). ">"/eg;

print $html;

sub replace_externals {
    my ($inner) = @_;
    return $inner =~ $internal_link ? $inner : "$inner rel=\"nofollow\"";
}

Alternatively you can surely use negative look-aheads, but that would just mess up the readability..

Upvotes: 0

Related Questions