Reputation: 10887
I have one huge HTML files with many links i.e. <a href="...">
. I need to substitute each href
with a unique arbitrary value. So, after substitution the first link will be <a href="http://link1">
, second link <a href="http://link2">
, and so on.
Can we do this using a regex? Or, do I need to write a small script to scan over the file? Ideally, the solution will be a Perl or bash script (not something proprietary).
Thanks.
Upvotes: 0
Views: 138
Reputation: 4557
Perl is probably your best bet, but I wouldn't try to do it in one regex (might not even be possible). I think this is as short as you can make the script while still making it readable:
#!/usr/bin/perl
$link = 1;
while(<>) {
$link++ while( s/href="(?!link\d)[^"]*"/href="link$link"/ );
print;
}
Then call it like so:
./thatScript.pl inputFile.html > newInputFile.html
It will examine each line of input, and for each href="..."
it finds, replaces it with a numbered link and increments the link number. There is also a negative lookahead to avoid replacing the same href
continuously.
EDIT: Just for the hell of it, here's how you would compress the above into a single line of bash:
perl -pe '$link++ while( s/href="(?!link\d)[^"]*"/href="link$link"/ )' inFile.html > outFile.html
This makes use of Perl's amazing -p
flag, as explained here.
Upvotes: 2
Reputation: 246817
untested:
perl -pe 's{(href=")[^"]+}{$1 . "http://link" . ++$count}ge' filename > newfile
Upvotes: 0
Reputation: 4335
I definitely don't recommend this (tchrist is right, of course, it should be a script) but it does have the virtue of being terse and fulfilling the literal requirements in a deterministic/repeatable way without needing to save state/mapping.
perl -MDigest::MD5=md5_hex -MXML::LibXML -le '$d = XML::LibXML->load_html( location => shift || die "need location" ); for $a ( $d->findnodes("//\@href") ) { $a->setValue( md5_hex $a->value ) }; print $d->serialize' targeted.html
Upvotes: 1