moey
moey

Reputation: 10887

Global Regex Substitution with Unique Arbitrary Values

I have one huge HTML files with many links i.e. <a href="...">. I need to substitute each href with a unique arbitrary value. So, after substitution the first link will be <a href="http://link1">, second link <a href="http://link2">, and so on.

Can we do this using a regex? Or, do I need to write a small script to scan over the file? Ideally, the solution will be a Perl or bash script (not something proprietary).

Thanks.

Upvotes: 0

Views: 138

Answers (3)

Chriszuma
Chriszuma

Reputation: 4557

Perl is probably your best bet, but I wouldn't try to do it in one regex (might not even be possible). I think this is as short as you can make the script while still making it readable:

#!/usr/bin/perl
$link = 1;
while(<>) {
    $link++ while( s/href="(?!link\d)[^"]*"/href="link$link"/ );
    print;
}

Then call it like so:

./thatScript.pl inputFile.html > newInputFile.html

It will examine each line of input, and for each href="..." it finds, replaces it with a numbered link and increments the link number. There is also a negative lookahead to avoid replacing the same href continuously.

EDIT: Just for the hell of it, here's how you would compress the above into a single line of bash:

perl -pe '$link++ while( s/href="(?!link\d)[^"]*"/href="link$link"/ )' inFile.html > outFile.html

This makes use of Perl's amazing -p flag, as explained here.

Upvotes: 2

glenn jackman
glenn jackman

Reputation: 246817

untested:

perl -pe 's{(href=")[^"]+}{$1 . "http://link" . ++$count}ge' filename > newfile

Upvotes: 0

Ashley
Ashley

Reputation: 4335

I definitely don't recommend this (tchrist is right, of course, it should be a script) but it does have the virtue of being terse and fulfilling the literal requirements in a deterministic/repeatable way without needing to save state/mapping.

perl -MDigest::MD5=md5_hex -MXML::LibXML -le '$d = XML::LibXML->load_html( location => shift || die "need location" ); for $a ( $d->findnodes("//\@href") ) { $a->setValue( md5_hex $a->value ) }; print $d->serialize' targeted.html

Upvotes: 1

Related Questions