Reputation: 33
I'm (obviously) new to Perl and am trying to create a simple script to clean up a large file on about 4.5 million records on a weekly basis. I want to completely remove the lines that match one of three patterns. The file looks like this:
D0832
G2565
ZDS97
FHM2547
JDH1464
R2918
4918K
AG01023
AG02997
My script below works, but I get a blank line where a deletion occurs (substitution) rather than removing the line completely.
#!/usr/bin/perl
open( FH, "serial.txt" ) || die "Couldn't open file...\n";
while ( <FH> ) {
$data .= $_;
}
$data =~ s/[A][F|G][(0-9)]{5}//g;
$data =~ s/[A-Z][0-9][0-9][0-9][0-9]//g;
$data =~ s/[0-9][0-9][0-9][0-9][A-Z]//g;
print $data;
close( FH );
My question is - with 4.5 million records, running this at least once a week, is this an efficient/fast way to accomplish what I want to do, or is there a more efficient way to do it? In addition, how can I remove the lines rather than substituting a blank line?
Thanks all. Stephen
Upvotes: 1
Views: 2161
Reputation: 2589
$data =~ s/[A-Z][0-9][0-9][0-9][0-9][\s\r\n]*//g;
$data =~ s/[0-9][0-9][0-9][0-9][A-Z][\s\r\n]*//g;
From the question:
"how can I remove the lines rather than substituting a blank line?"
End of the each regex which we can have a linebreak/returns. And then regex will replacing the empty line. Hence I have added the [\s\r\n]* syntax and it will not replace the empty line.
Upvotes: 0
Reputation: 132812
At first pass, I'd make a list of pre-compiled patterns to test against each line. The problem is likely to change and I want to add and delete patterns without disturbing the meat of the code:
my @patterns = (
qr/\A [A] [FG] [0-9]{5} \Z/x,
qr/\A [A-Z] [0-9]{4} \Z/x,
qr/\A [0-9]{4} [A-Z] \Z/x,
);
while( my $line = <DATA> ) {
next if grep { $line =~ $_ } @patterns;
print $line;
}
__END__
D0832
G2565
ZDS97
FHM2547
JDH1464
R2918
4918K
AG01023
AG02997
The big improvement isn't the patterns though. It's checking things one line at a time and printing the lines I want to keep. I don't have the entire file in memory at the same time; it's only a line at a time.
There's a problem with this though. It works, but it checks every pattern every time. That might not mean much if very few lines will ever match or there are only a few patterns. If you think it might matter, using first
from List::Util instead of grep
can help since it only needs to find one match and stops when it finds it:
use List::Util qw(first);
my @patterns = (
qr/\A [A] [FG] [0-9]{5} \Z/x,
qr/\A [A-Z] [0-9]{4} \Z/x,
qr/\A [0-9]{4} [A-Z] \Z/x,
);
while( my $line = <DATA> ) {
next if first { $line =~ $_ } @patterns;
print $line;
}
__END__
D0832
G2565
ZDS97
FHM2547
JDH1464
R2918
4918K
AG01023
AG02997
Or, I might make one giant pattern. Regexp::Assemble can put them together (but so can you if you watch out for the alternation precedence):
use v5.10;
use Regexp::Assemble;
my @patterns = (
'[A][FG][0-9]{5}',
'[A-Z][0-9]{4}',
'[0-9]{4}[A-Z]',
);
my $grand_pattern = do {
my $ra = Regexp::Assemble->new;
$ra->add( $_ ) for @patterns;
my $re = $ra->re;
qr/ \A (?: $re ) \Z /x;
};
say "Grand regex is $grand_pattern";
while( my $line = <DATA> ) {
next if $line =~ $grand_pattern;
print $line;
}
__END__
D0832
G2565
ZDS97
FHM2547
JDH1464
R2918
4918K
AG01023
AG02997
The next step would be to take the patterns from the command line or a configuration file, but that's not so hard. The program shouldn't know the patterns at all. You'll have a much easier time changing the patterns if you don't have to change the code.
Upvotes: 3
Reputation: 126722
There's no need for multiple regex patterns. This will do what you need
perl -ne'print unless /^(?:[A][FG]\d{5}|[A-Z]\d{4}|\d{4}[A-Z])$/' serial.txt
ZDS97
FHM2547
JDH1464
Upvotes: 0
Reputation: 223023
@ndn's comment is correct. However, personally, rather than reading in the whole file, I'd process it line by line (I took the liberty to tidy up your regexes, too):
#!/usr/bin/perl -p
$_ = '' if /^A[FG]\d{5}$/ || /^[A-Z]\d{4}$/ || /^\d{4}[A-Z]$/;
or
#!/usr/bin/perl -n
print unless /^A[FG]\d{5}$/ || /^[A-Z]\d{4}$/ || /^\d{4}[A-Z]$/;
(In both cases, specify your input file on the command line. Read up the perlrun manual page on how the -p
and -n
options work.)
Upvotes: 3