Reputation: 11703
I want to delete lines, matching any of the pattern listed in exclusion file, from input file.
Input file is pretty huge (~500 MB) so I am also looking for efficient solution.
Please note that below example is just a sample and exclusion may contain complex pattern including special characters e.g. /
File containing list of exclusions (exception)
Jun
Jul
Aug
Input file (infile)
Jan 02, 2013
Jul 02, 1988
Feb 02, 1988
Jun 02, 1988
Feb 02, 1988
Aug 02, 1988
Jan 02, 2013
Sep 02, 1988
Mar 02, 1988
Jun 02, 1988
Nov 02, 1988
Desired Output (outfile)
Jan 02, 2013
Feb 02, 1988
Feb 02, 1988
Jan 02, 2013
Sep 02, 1988
Mar 02, 1988
Nov 02, 1988
I can use following command, given list of exclusions, and it works fine.
egrep -v "Jun|Jul|Aug" infile > outfile
My problem is how to get a pipe |
separated string from exception file and pass it to above grep command? Or is there any other optimum way to achieve this?
I have to implement this as a part of perl
solution in in which further processing is through hash. But I am open to any linux solution as I can execute those commands from within my perl script.
Your help in this regards would be highly appreciated.
UPDATE
Meanwhile people are helping me out with their solutions, I could write following piece of code in perl
and it also worked.
#!/usr/bin/perl
use warnings;
use strict;
open my $exfread, '<', "exception" or die $!;
chomp ( my @exclusions = <$exfread> );
close $exfread;
my $ex_str = join '|', @exclusions;
# print $ex_str . "\n";
open my $infread, '<', "infile" or die $!;
open my $outfwrite, '>', "outfile" or die $!;
while (<$infread>) {
next if /${ex_str}/;
print $outfwrite $_;
# do some more processing using hash
}
close $outfwrite;
close $infread;
I would love to hear feedback for different approaches with respect to their efficiency. As I mentioned earlier, since my input file is huge and number of files are also significant, next point of worry for me would be run time.
Upvotes: 0
Views: 2776
Reputation: 126722
This program should suit your purposes. It works by forming a regular expression from the contents of exception.txt
by joining each line with the alternation operator |
. Then the regex is compiled with qr
.
This should prove extremely fast, as only a single regex comparison is performed for each line.
use strict;
use warnings;
use autodie;
my $regex = do {
open my $in, '<', 'exception.txt';
my @infile = <$in>;
chomp @infile;
local $" = '|';
qr/@infile/;
};
open my $in, '<', 'infile.txt';
open my $out, '>', 'outfile.txt';
while (<$in>) {
print $out $_ unless $_ =~ $regex;
}
output
Jan 02, 2013
Feb 02, 1988
Feb 02, 1988
Jan 02, 2013
Sep 02, 1988
Mar 02, 1988
Nov 02, 1988
Upvotes: 4
Reputation: 33360
Instead of going outside of Perl, why not just read and filter inside like such?
#!/usr/bin/env perl
use strict;
use warnings;
my $ifile = 'old.txt';
my $ofile = 'new.txt';
open (my $ifh, '<', $ifile) or die $!;
open (my $ofh, '>', $ofile) or die $!;
while (<$ifh>) {
print $ofh $_ unless /^Jun|Jul|Aug/;
}
close ($ifh);
close ($ofh);
Upvotes: 0