joe
joe

Reputation: 35077

Perl file processing optimization for regex

I am reading around more than million lines of million files.

But have tried to replace them using regular experssion for the certain strings .

My strings are ( "tiger", "lion", "monkey") and replacing them by string "animal";

I have achived using regex substitution

$line =~ s/tiger/animal/g;
$line =~ s/lion/animal/g;
$line =~ s/monkey/animal/g;

When processed, it takes a lot of time during execution.

Here I want to understand why this is slow and how can I solve this problem in faster way?

I can't use any external modules to resolve this issue.

Upvotes: 0

Views: 162

Answers (3)

Borodin
Borodin

Reputation: 126722

I'm not clear what "around more than million lines of million files" means, but suppose you have a million files, each with a million lines of, say, 40 characters each. That comes to 40TB of information.

If the data is on a hard disk, reading at, say, 50MB/s, this amount of data will take 40E12/50E6 = 800,000 seconds to read, or just over nine days.

If your program is completing in a few hours then you should be very grateful!

Upvotes: 0

fge
fge

Reputation: 121712

Use the "precompiled form" of regexes:

my $regex = qr/\b(?:tiger|lion|monkey)\b/;

# in your loop:
$line ~= s/$regex/animal/g;

Note: the regex has been reduced to a single one, and a non capturing group (?:...) is used since there is no use for the captured text. Also, word anchors have been added (this means that monkey will be matched but not greasemonkey, for instance). Add s? before the last \b if you also want to replace plurals.

This, however, only takes care about the regex part: you also talk about other kinds of processing, maybe the entire process can be altered in some way so that it is eventually faster.

Upvotes: 5

Vijay
Vijay

Reputation: 67211

You can also do this instead of 3 parts.

$line=~s/(tiger|monkey|lion)/animal/g;

Upvotes: 0

Related Questions