Reputation: 35077
I am reading around more than million lines of million files.
But have tried to replace them using regular experssion for the certain strings .
My strings are ( "tiger", "lion", "monkey") and replacing them by string "animal";
I have achived using regex substitution
$line =~ s/tiger/animal/g;
$line =~ s/lion/animal/g;
$line =~ s/monkey/animal/g;
When processed, it takes a lot of time during execution.
Here I want to understand why this is slow and how can I solve this problem in faster way?
I can't use any external modules to resolve this issue.
Upvotes: 0
Views: 162
Reputation: 126722
I'm not clear what "around more than million lines of million files" means, but suppose you have a million files, each with a million lines of, say, 40 characters each. That comes to 40TB of information.
If the data is on a hard disk, reading at, say, 50MB/s, this amount of data will take 40E12/50E6 = 800,000 seconds to read, or just over nine days.
If your program is completing in a few hours then you should be very grateful!
Upvotes: 0
Reputation: 121712
Use the "precompiled form" of regexes:
my $regex = qr/\b(?:tiger|lion|monkey)\b/;
# in your loop:
$line ~= s/$regex/animal/g;
Note: the regex has been reduced to a single one, and a non capturing group (?:...)
is used since there is no use for the captured text. Also, word anchors have been added (this means that monkey
will be matched but not greasemonkey
, for instance). Add s?
before the last \b
if you also want to replace plurals.
This, however, only takes care about the regex part: you also talk about other kinds of processing, maybe the entire process can be altered in some way so that it is eventually faster.
Upvotes: 5
Reputation: 67211
You can also do this instead of 3 parts.
$line=~s/(tiger|monkey|lion)/animal/g;
Upvotes: 0