Reputation: 149
I've got about 400'000 files that need some text to be replaced.
I tried the following Perl script:
@files = <*.html>;
foreach $file (@files) {
`perl -0777 -i -pe 's{<div[^>]+?id="user-info"[^>]*>.*?</div>}{}gsmi;' $file`;
`perl -0777 -i -pe 's{<div[^>]+?class="generic"[^>]*>[^\s]*<small>[^\s]*Author.*?</div>.*?</div>.*?</div>.*?</div>.*?</div>}{}gsmi;' $file`;
`perl -0777 -i -pe 's{<script[^>]+?src="javascript.*?"[^>]*>.*?</script>}{}gsmi;' $file`;
`perl -p -i -e 's/.css.html/.css/g;' $file`;
}
I don't have a deep Perl knowledge, but the script runs too slow (updates only about 180 files per day).
Is there a way to speed it up?
Thank you in advance!
PS: When I tested it on a smaller number of files, I've noticed a much better performance...
Upvotes: 4
Views: 403
Reputation: 67900
First off, if you load 400,000 file names into memory, that's going to suck up some memory. You can easily just iterate through the file list by for example:
File::Find
opendir
+ while (readdir($dh))
(does not load the entire list)Second, using backticks spawns a new process in the shell, and it is very ineffective. You could just open the files normally, slurp them, and then reprint to the same file name. E.g.
while (my $file = readdir($dh)) {
open my $fh, "<", $file or die $!;
local $/;
my $text = <$fh>; # slurp file
$text =~ s/....//g; # do your substitutions
open $fh, ">", $file or die $!;
print $fh $text; # overwrite file, same as -i switch does
}
Lastly.. using regexes to edit html is not ideal. It might work for your case, but it might be worthwhile to invest some time learning an html parser. Not sure how suitable it would be for this particular case, but it might be worth looking into, to make your code more stable.
Upvotes: 7
Reputation: 241758
Calling perl from perl will always be slower than doing all the work in one process. So, the solution might be
perl -i -pe 'BEGIN { undef $/ }
s{<div[^>]+?id="user-info"[^>]*>.*?</div>}{}gsmi;
s{<div[^>]+?class="generic"[^>]*>[^\s]*<small>[^\s]*Author.*?</div>.*?</div>.*?</div>.*?</div>.*?</div>}{}gsmi;
s{<script[^>]+?src="javascript.*?"[^>]*>.*?</script>}{}gsmi;
s/.css.html/.css/g;
' *.html
Upvotes: 8