user1090708
user1090708

Reputation: 67

Reading and comparing lines in Perl

I am having trouble with getting my perl script to work. The issue might be related to the reading of the Extract file line by line within the while loop, any help would be appreciated. There are two files

Bad file that contains a list of bad IDs (100s of IDs)

2
3

Extract that contains a delimited data with the ID in field 1 (millions of rows)

1|data|data|data
2|data|data|data
2|data|data|data
2|data|data|data
3|data|data|data
4|data|data|data
5|data|data|data

I am trying to remove all the rows from the large extract file where the IDs match. There can be multiple rows where the ID matches. The extract is sorted.

#use strict;
#use warnning;

$SourceFile = $ARGV[0];
$ToRemove = $ARGV[1];
$FieldNum = $ARGV[2];
$NewFile = $ARGV[3];
$LargeRecords = $ARGV[4];

open(INFILE, $SourceFile) or die "Can't open source file: $SourceFile \n";
open(REMOVE, $ToRemove) or die "Can't open toRemove file: $ToRemove \n";
open(OutGood, "> $NewFile") or die "Can't open good output file \n";
open(OutLarge, "> $LargeRecords") or die "Can't open Large Records output file \n";


#Read in the list of bad IDs into array
@array = <REMOVE>;

#Loop through each bad record 
foreach (@array)
{
$badID = $_;

#read the extract line by line 
while(<INFILE>)
{
    #take the line and split it into 
    @fields = split /\|/, $_;
    my $extractID = $fields[$FieldNum];

    #print "Here's what we got: $badID and $extractID\n";

    while($extractID == $badID) 
    {
        #Write out bad large records
        print OutLarge join '|', @fields;

        #Get the next line in the extract file
        @fields = split /\|/, <INFILE>;
        my $extractID = $fields[$FieldNum];

        $found = 1; #true

        #print " We got a match!!";

        #remove item after it has been found 
        my $input_remove = $badID;
        @array = grep {!/$input_remove/} @array;


    }

print OutGood join '|', @fields;

}

}

Upvotes: 0

Views: 823

Answers (3)

Austin Hastings
Austin Hastings

Reputation: 627

Try this:

$ perl -F'|' -nae 'BEGIN {while(<>){chomp; $bad{$_}++;last if eof;}} print unless $bad{$F[0]};' bad good

Upvotes: 2

Sinan &#220;n&#252;r
Sinan &#220;n&#252;r

Reputation: 118138

First, you are lucky: The number of bad IDs is small. That means, you can read the list of bad IDs once, stick them in a hash table without running into any difficulty with memory usage. Once you have them in a hash, you just read the big data file line by line, skipping output for bad IDs.

#!/usr/bin/env perl

use strict;
use warnings;

# hardwired for convenience
my $bad_id_file = 'bad.txt';
my $data_file = 'data.txt';

my $bad_ids = read_bad_ids($bad_id_file);

remove_data_with_bad_ids($data_file, $bad_ids);

sub remove_data_with_bad_ids {
    my $file = shift;
    my $bad = shift;

    open my $in, '<', $file
        or die "Cannot open '$file': $!";
    while (my $line = <$in>) {
        if (my ($id) = extract_id(\$line)) {
            exists $bad->{ $id } or print $line;
        }
    }

    close $in
        or die "Cannot close '$file': $!";
    return;
}

sub read_bad_ids {
    my $file = shift;
    open my $in, '<', $file
        or die "Cannot open '$file': $!";

    my %bad;
    while (my $line = <$in>) {
        if (my ($id) = extract_id(\$line)) {
            $bad{ $id } = undef;
        }
    }
    close $in
        or die "Cannot close '$file': $!";
    return \%bad;
}

sub extract_id {
    my $string_ref = shift;
    if (my ($id) = ($$string_ref =~ m{\A ([0-9]+) }x)) {
        return $id;
    }
    return;
}

Upvotes: 1

fugu
fugu

Reputation: 6578

I'd use a hash as follows:

use warnings;
use strict;

my @bad = qw(2 3);

my %bad;

$bad{$_} = 1 foreach @bad;

my @file = qw (1|data|data|data 2|data|data|data 2|data|data|data 2|data|data|data 3|data|data|data 4|data|data|data 5|data|data|data);

my %hash;
foreach (@file){
    my @split = split(/\|/);
    $hash{$split[0]} = $_;
}

foreach (sort keys %hash){
    print "$hash{$_}\n" unless exists $bad{$_};
}

Which gives:    

1|data|data|data
4|data|data|data
5|data|data|data

Upvotes: 1

Related Questions