jkshah
jkshah

Reputation: 11703

Delete lines from input files excluding patterns listed in another file

I want to delete lines, matching any of the pattern listed in exclusion file, from input file.

Input file is pretty huge (~500 MB) so I am also looking for efficient solution.

Please note that below example is just a sample and exclusion may contain complex pattern including special characters e.g. /

File containing list of exclusions (exception)

Jun
Jul
Aug

Input file (infile)

Jan 02, 2013
Jul 02, 1988
Feb 02, 1988
Jun 02, 1988
Feb 02, 1988
Aug 02, 1988
Jan 02, 2013
Sep 02, 1988
Mar 02, 1988
Jun 02, 1988
Nov 02, 1988

Desired Output (outfile)

Jan 02, 2013
Feb 02, 1988
Feb 02, 1988
Jan 02, 2013
Sep 02, 1988
Mar 02, 1988
Nov 02, 1988

I can use following command, given list of exclusions, and it works fine.

egrep -v "Jun|Jul|Aug" infile > outfile

My problem is how to get a pipe | separated string from exception file and pass it to above grep command? Or is there any other optimum way to achieve this?

I have to implement this as a part of perl solution in in which further processing is through hash. But I am open to any linux solution as I can execute those commands from within my perl script.

Your help in this regards would be highly appreciated.

UPDATE

Meanwhile people are helping me out with their solutions, I could write following piece of code in perl and it also worked.

#!/usr/bin/perl
use warnings;
use strict;

open my $exfread, '<', "exception" or die $!;
chomp ( my @exclusions = <$exfread> );
close $exfread;
my $ex_str = join '|', @exclusions;
# print $ex_str . "\n";

open my $infread, '<', "infile" or die $!;
open my $outfwrite, '>', "outfile" or die $!;

while (<$infread>) {
    next if /${ex_str}/;    
    print $outfwrite $_;
    # do some more processing using hash
}

close $outfwrite;
close $infread;

I would love to hear feedback for different approaches with respect to their efficiency. As I mentioned earlier, since my input file is huge and number of files are also significant, next point of worry for me would be run time.

Upvotes: 0

Views: 2776

Answers (4)

Michael
Michael

Reputation: 242

grep -vf patternfile 

should do the same as a unix command.

Upvotes: 1

Kent
Kent

Reputation: 195029

for your example, this line works:

grep -vf exception infile

Upvotes: 2

Borodin
Borodin

Reputation: 126722

This program should suit your purposes. It works by forming a regular expression from the contents of exception.txt by joining each line with the alternation operator |. Then the regex is compiled with qr.

This should prove extremely fast, as only a single regex comparison is performed for each line.

use strict;
use warnings;
use autodie;

my $regex = do {
  open my $in,  '<', 'exception.txt';
  my @infile = <$in>;
  chomp @infile;
  local $" = '|';
  qr/@infile/;
};

open my $in,  '<', 'infile.txt';
open my $out, '>', 'outfile.txt';

while (<$in>) {
  print $out $_ unless $_ =~ $regex;
}

output

Jan 02, 2013
Feb 02, 1988
Feb 02, 1988
Jan 02, 2013
Sep 02, 1988
Mar 02, 1988
Nov 02, 1988

Upvotes: 4

squiguy
squiguy

Reputation: 33360

Instead of going outside of Perl, why not just read and filter inside like such?

#!/usr/bin/env perl

use strict;
use warnings;

my $ifile = 'old.txt';
my $ofile = 'new.txt';

open (my $ifh, '<', $ifile) or die $!;
open (my $ofh, '>', $ofile) or die $!;

while (<$ifh>) {
    print $ofh $_ unless /^Jun|Jul|Aug/;
}

close ($ifh);
close ($ofh);

Upvotes: 0

Related Questions