user1751343
user1751343

Reputation: 149

Perl Search and replace in 400'000 files

I've got about 400'000 files that need some text to be replaced.

I tried the following Perl script:

@files = <*.html>;

foreach $file (@files) {
    `perl -0777 -i -pe 's{<div[^>]+?id="user-info"[^>]*>.*?</div>}{}gsmi;' $file`;

    `perl -0777 -i -pe 's{<div[^>]+?class="generic"[^>]*>[^\s]*<small>[^\s]*Author.*?</div>.*?</div>.*?</div>.*?</div>.*?</div>}{}gsmi;' $file`;

    `perl -0777 -i -pe 's{<script[^>]+?src="javascript.*?"[^>]*>.*?</script>}{}gsmi;' $file`;

    `perl -p -i -e 's/.css.html/.css/g;' $file`;
}

I don't have a deep Perl knowledge, but the script runs too slow (updates only about 180 files per day).

Is there a way to speed it up?

Thank you in advance!

PS: When I tested it on a smaller number of files, I've noticed a much better performance...

Upvotes: 4

Views: 403

Answers (2)

TLP
TLP

Reputation: 67900

First off, if you load 400,000 file names into memory, that's going to suck up some memory. You can easily just iterate through the file list by for example:

  • File::Find
  • opendir + while (readdir($dh)) (does not load the entire list)

Second, using backticks spawns a new process in the shell, and it is very ineffective. You could just open the files normally, slurp them, and then reprint to the same file name. E.g.

while (my $file = readdir($dh)) {
    open my $fh, "<", $file or die $!;
    local $/;
    my $text = <$fh>;                # slurp file
    $text =~ s/....//g;              # do your substitutions
    open $fh, ">", $file or die $!;
    print $fh $text;                 # overwrite file, same as -i switch does
}

Lastly.. using regexes to edit html is not ideal. It might work for your case, but it might be worthwhile to invest some time learning an html parser. Not sure how suitable it would be for this particular case, but it might be worth looking into, to make your code more stable.

Upvotes: 7

choroba
choroba

Reputation: 241758

Calling perl from perl will always be slower than doing all the work in one process. So, the solution might be

perl -i -pe 'BEGIN { undef $/ }
             s{<div[^>]+?id="user-info"[^>]*>.*?</div>}{}gsmi;
             s{<div[^>]+?class="generic"[^>]*>[^\s]*<small>[^\s]*Author.*?</div>.*?</div>.*?</div>.*?</div>.*?</div>}{}gsmi;
             s{<script[^>]+?src="javascript.*?"[^>]*>.*?</script>}{}gsmi;
             s/.css.html/.css/g;
    ' *.html

Upvotes: 8

Related Questions