Reputation: 864
I'm using Perl to perform some file cleansing, and am running into some performance issues. One of the major parts of my code involves standardizing name fields. I have several sections that look like this:
sub substitute_titles
{
my ($inStr) = @_;
${$inStr} =~ s/ PHD./ PHD /;
${$inStr} =~ s/ P H D / PHD /;
${$inStr} =~ s/ PROF./ PROF /;
${$inStr} =~ s/ P R O F / PROF /;
${$inStr} =~ s/ DR./ DR /;
${$inStr} =~ s/ D.R./ DR /;
${$inStr} =~ s/ HON./ HON /;
${$inStr} =~ s/ H O N / HON /;
${$inStr} =~ s/ MR./ MR /;
${$inStr} =~ s/ MRS./ MRS /;
${$inStr} =~ s/ M R S / MRS /;
${$inStr} =~ s/ MS./ MS /;
${$inStr} =~ s/ MISS./ MISS /;
}
I'm passing by reference to try and get at least a little speed, but I fear that running so many (literally hundreds) of specific string replaces on tens of thousands (likely hundreds of thousands eventually) of records is going to hurt the performance.
Is there a better way to implement this kind of logic than what I'm doing currently?
Thanks
Edit: Quick note, not all the replace functions are just removing periods and spaces. There are string deletions, soundex groups, etc.
Upvotes: 2
Views: 554
Reputation: 1315
I would most likely make a sub that created my patterns for me. This way all I would have to do is pass in an array of the titles I want normalized. Example:
sub make_pattern {
my $list_ref = shift;
my %patterns;
for my $title ( @{$list_ref} ) {
my $result = uc $title;
my $pattern = '/' . join( '\s*', (//, $title)) . '\.*/i';
$patterns{$pattern} = $result;
}
return \%patterns;
}
my @titles = qw (PHD MD DR PROF ) #... plus whatever other titles you have
my $conversion_hash = make_pattern(\@titles);
Then you the resulting hash in conjunction with a closure as listed in some of the other answers here. I have not had time to test my code yet, but it should work.
Upvotes: 0
Reputation: 40142
Rather than running each substitution separately, create a closure that can do the work for you in a more efficient way:
sub make_translator {
my %table = @_;
my $regex = join '|' => map {quotemeta} keys %table;
$regex = qr/$regex/;
return sub {s/($regex)/$table{$1}/g}
}
my $translator = make_translator
' PHD.' => ' PHD ',
' P H D ' => ' PHD ',
' PROF.' => ' PROF '; # ... the rest of the pairs
my @list_of_strings = qw/.../;
$translator->() for @list_of_strings;
It is fastest to not pass anything and use $_
aliased to the array value (which the for
loop does for you).
Upvotes: 5
Reputation: 239980
Here's a technique that should work pretty well if all of your search items are fixed strings:
my %title_replacements = (
' PHD.' => ' PHD ',
' P H D ' => ' PHD ',
# ...,
);
my $titles_to_replace = join '|',
map quotemeta,
keys %title_replacements;
$titles_to_replace = qr/$titles_to_replace/;
sub substitute_titles {
my ($in) = @_;
$$in =~ s/($titles_to_replace)/$title_replacements{$1}/g;
}
If you're running on a perl older than 5.10.0 or 5.8.9, you should consider using Regexp::Trie or Regexp::Assemble to build the regex, but on current perls the regex compiler will automatically trie-optimize any large list of alternations like that, so I left out the unnecessary complication.
Upvotes: 5