Reputation: 3
On my forum, I want to automatically add rel="nofollow" to links that point to external sites. For instance, someone creates a post with the following text:
Link 1: <a href="http://www.external1.com" target="_blank">External Link 1</A>
Link 2: <a href="http://www.myforum.com">Local Link 1</A>
Link 3: <a href="http://www.external2.com">External Link 2</A>
Link 4: <a href="http://www.myforum.com/test" ALT="Local">Local Link 2</A>
Using Perl, I want that changed to:
Link 1: <a href="http://www.external1.com" target="_blank" rel="nofollow">External Link 1</A>
Link 2: <a href="http://www.myforum.com">Local Link 1</A>
Link 3: <a href="http://www.external2.com" rel="nofollow">External Link 2</A>
Link 4: <a href="http://www.myforum.com/test" ALT="Local">Local Link 2</A>
I can do this using quite a few lines of code, but I was hoping I could do this with one or more regexes. But I can't figure out how.
Upvotes: 0
Views: 652
Reputation: 57600
Regexes can work in limited scenarios, but you should never use regexes to parse HTML
Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp.
— from RegEx match open tags except XHTML self-contained tags
I am quite fond of the Mojo suite, because this allows us to use a proper parser with very little code. We can the use CSS selectors to find interesting elements:
use strict; use warnings;
use autodie;
use Mojo;
use File::Slurp;
for my $filename (@ARGV) {
my $dom = Mojo::DOM->new(scalar read_file $filename);
for my $link ($dom->find('a[href]')->each) {
$link->attr(rel => 'nofollow')
if $link->attr('href') !~ m(\Ahttps?://www[.]myforum[.]com(?:/|\z));
}
write_file "$filename~", "$dom";
rename "$filename~" => $filename;
}
Invocation: perl mark-links-as-nofollow.pl *.html
A test run on your data produces the output:
Link 1: <a href="http://www.external1.com" rel="nofollow" target="_blank">External Link 1</a>
Link 2: <a href="http://www.myforum.com">Local Link 1</a>
Link 3: <a href="http://www.external2.com" rel="nofollow">External Link 2</a>
Link 4: <a alt="Local" href="http://www.myforum.com/test">Local Link 2</a>
Why did I use tempfiles and rename
? On most file systems, a file can be renamed atomically, whereas writing to a file takes some time. So other processes might see a half-written file.
Upvotes: 1
Reputation: 2213
I'd use a regex gobal and eval flag for callback, eg like so:
#!/usr/bin/perl
use strict;
my $internal_link = qr'href="https?:\/\/(?:www\.)?myforum\.com';
my $html = '
Lorem ipsum
<a href="http://www.external1.com" target="_blank">External Link 1</A>
Lorem ipsum
<a href="http://www.myforum.com">Local Link 1</A>
Lorem ipsum
<a href="http://www.external2.com">External Link 2</A>
Lorem ipsum
<a href="http://www.myforum.com/test" ALT="Local">Local Link 2</A>
';
$html =~ s/<a ([^>]+)>/"<a ". replace_externals($1). ">"/eg;
print $html;
sub replace_externals {
my ($inner) = @_;
return $inner =~ $internal_link ? $inner : "$inner rel=\"nofollow\"";
}
Alternatively you can surely use negative look-aheads, but that would just mess up the readability..
Upvotes: 0