Stephen Dundas
Stephen Dundas

Reputation: 33

Perl - Removing multiple lines from file with multiple regex

I'm (obviously) new to Perl and am trying to create a simple script to clean up a large file on about 4.5 million records on a weekly basis. I want to completely remove the lines that match one of three patterns. The file looks like this:

D0832
G2565
ZDS97
FHM2547
JDH1464
R2918
4918K
AG01023
AG02997

My script below works, but I get a blank line where a deletion occurs (substitution) rather than removing the line completely.

#!/usr/bin/perl

open( FH, "serial.txt" ) || die "Couldn't open file...\n";

while ( <FH> ) {
   $data .= $_;
}

$data =~ s/[A][F|G][(0-9)]{5}//g;
$data =~ s/[A-Z][0-9][0-9][0-9][0-9]//g;
$data =~ s/[0-9][0-9][0-9][0-9][A-Z]//g;

print $data;
close( FH );

My question is - with 4.5 million records, running this at least once a week, is this an efficient/fast way to accomplish what I want to do, or is there a more efficient way to do it? In addition, how can I remove the lines rather than substituting a blank line?

Thanks all. Stephen

Upvotes: 1

Views: 2161

Answers (4)

ssr1012
ssr1012

Reputation: 2589

 $data =~ s/[A-Z][0-9][0-9][0-9][0-9][\s\r\n]*//g;
 $data =~ s/[0-9][0-9][0-9][0-9][A-Z][\s\r\n]*//g;

From the question:

"how can I remove the lines rather than substituting a blank line?"

End of the each regex which we can have a linebreak/returns. And then regex will replacing the empty line. Hence I have added the [\s\r\n]* syntax and it will not replace the empty line.

Upvotes: 0

brian d foy
brian d foy

Reputation: 132812

At first pass, I'd make a list of pre-compiled patterns to test against each line. The problem is likely to change and I want to add and delete patterns without disturbing the meat of the code:

my @patterns = ( 
    qr/\A [A] [FG]  [0-9]{5} \Z/x,
    qr/\A [A-Z]     [0-9]{4} \Z/x,
    qr/\A [0-9]{4}  [A-Z]    \Z/x,
    );

while( my $line = <DATA> ) {
    next if grep { $line =~ $_ } @patterns;

    print $line;
    }

__END__
D0832
G2565
ZDS97
FHM2547
JDH1464
R2918
4918K
AG01023
AG02997

The big improvement isn't the patterns though. It's checking things one line at a time and printing the lines I want to keep. I don't have the entire file in memory at the same time; it's only a line at a time.

There's a problem with this though. It works, but it checks every pattern every time. That might not mean much if very few lines will ever match or there are only a few patterns. If you think it might matter, using first from List::Util instead of grep can help since it only needs to find one match and stops when it finds it:

use List::Util qw(first);

my @patterns = ( 
    qr/\A [A] [FG]  [0-9]{5} \Z/x,
    qr/\A [A-Z]     [0-9]{4} \Z/x,
    qr/\A [0-9]{4}  [A-Z]    \Z/x,
    );

while( my $line = <DATA> ) {
    next if first { $line =~ $_ } @patterns;

    print $line;
    }

__END__
D0832
G2565
ZDS97
FHM2547
JDH1464
R2918
4918K
AG01023
AG02997

Or, I might make one giant pattern. Regexp::Assemble can put them together (but so can you if you watch out for the alternation precedence):

use v5.10;

use Regexp::Assemble;

my @patterns = ( 
    '[A][FG][0-9]{5}',
    '[A-Z][0-9]{4}',
    '[0-9]{4}[A-Z]',
    );

my $grand_pattern = do {
    my $ra = Regexp::Assemble->new;
    $ra->add( $_ ) for @patterns;
    my $re = $ra->re;
    qr/ \A (?: $re ) \Z /x;
    };

say "Grand regex is $grand_pattern";

while( my $line = <DATA> ) {
    next if $line =~ $grand_pattern;

    print $line;
    }

__END__
D0832
G2565
ZDS97
FHM2547
JDH1464
R2918
4918K
AG01023
AG02997

The next step would be to take the patterns from the command line or a configuration file, but that's not so hard. The program shouldn't know the patterns at all. You'll have a much easier time changing the patterns if you don't have to change the code.

Upvotes: 3

Borodin
Borodin

Reputation: 126722

There's no need for multiple regex patterns. This will do what you need

perl -ne'print unless /^(?:[A][FG]\d{5}|[A-Z]\d{4}|\d{4}[A-Z])$/' serial.txt

output

ZDS97
FHM2547
JDH1464

Upvotes: 0

C. K. Young
C. K. Young

Reputation: 223023

@ndn's comment is correct. However, personally, rather than reading in the whole file, I'd process it line by line (I took the liberty to tidy up your regexes, too):

#!/usr/bin/perl -p
$_ = '' if /^A[FG]\d{5}$/ || /^[A-Z]\d{4}$/ || /^\d{4}[A-Z]$/;

or

#!/usr/bin/perl -n
print unless /^A[FG]\d{5}$/ || /^[A-Z]\d{4}$/ || /^\d{4}[A-Z]$/;

(In both cases, specify your input file on the command line. Read up the perlrun manual page on how the -p and -n options work.)

Upvotes: 3

Related Questions