DKru
DKru

Reputation: 19

Perl: duplicate keys not overwriting in hash

I have a problem that I can't seem to find an answer for.

I have a CSV file that contains performance records for different individuals. There is only suppose to be one record per individual, however, there are some individuals that have several records with different information. I would like to compare this first file to another that also has a list of individuals, though I would only like to compare whether the individual in file 1 also has a record in file two (file 2 does not have duplicates). The individuals' IDs are unique.

Example for file 1:

ID number      A       B     C            D
4011NM16001    apple   24    sunday       2016-01-01
4011NM16001    apple   16    wednesday    2016-01-01
4012NM15687    pear    16    sunday       2015-04-19
4012NM15002    banana  8     monday       2015-09-09
4012NM14301    peach   10    wednesday    2014-03-18
4012NM14301    peach   18    wednesday    2014-03-18

I have opened the first file and tried to put the data into a hash (or rather a combination of a hash and array if I understand the concepts correctly) so as to remove the duplicates, using the ID as the unique key. However, instead of overwriting entries with the same ID, it still seems to add it, so I still end up with the duplicate records.

I want to see this:

ID number
4011NM16001
4011NM15687
4012NM15002
4012NM14301

But instead I still see this:

ID number      
4011NM16001    
4011NM16001    
4012NM15687    
4012NM15002    
4012NM14301    
4012NM14301    

Have I typed something wrong in my code or am I not using the hash correctly? I'm still new to Perl so I use parts of previous programs and try to learn as I go..

#!/usr/bin/env perl

use DBI;

use strict;
use warnings;

my $file1  = 'location1.csv';   #file1 containing records with duplicates
my $exists = 'location3.csv';  #output file with unique IDs that will be compared to file2

open (EXISTS, ">$exists") or die "Cannot open $exists";
    print EXISTS "ID number\n";

open (FILE1, "$file1") or die "Cannot open $file1";

while (<FILE1>){

    my %file1;

    my $line = $_;
    $line =~ s/\s*$//g;

    my ($ID, $a, $b, $c, $d) = split('\,', $line);
    next if !$ID or substr($ID,0,2) eq 'ID';

    $file1{$ID}[0]=$ID;  #unique ID number
    $file1{$ID}[1]=$a;   #record a
    $file1{$ID}[2]-$b;   #record b
    $file1{$ID}[3]=$c;   #record c
    $file1{$ID}[4]=$d;   #record d

    print EXISTS "$file1{$ID}[0]\n";
}

exit;

Upvotes: 0

Views: 575

Answers (2)

Borodin
Borodin

Reputation: 126722

In addition to choroba's diagnosis you need to declare the hash outside the while loop, otherwise each iteration of the loop is dealing with a new empty hash

Here's a version of your code that uses best-practice Perl and produces the result that you wanted. Note that I've had to alter the format of your input file location1.csv as the values you show don't contain any commas

#!/usr/bin/env perl

use strict;
use warnings;

my $file1  = 'location1.csv';    # file1 containing records with duplicates
my $exists = 'location3.csv';    # output file with unique IDs that will be compared to file2

open my $exists_fh, '>', $exists or die qq{Unable to open "$exists" for output: $!};
print $exists_fh "ID number\n";

open my $file1_fh, '<', $file1 or die qq{Unable to open "$file1" for input: $!};
<$file1_fh>; # skip header line

my %file1;

while ( <$file1_fh> ) {

    next unless /\S/; # Skip blank lines

    s/\s+\z//;

    my @fields = split /,/;
    my $id = $fields[0];

    next if $file1{$id}; # Skip this record if the ID is already known

    $file1{$id} = \@fields;

    print $exists_fh "$id\n"
}

output

ID number
4011NM16001
4012NM15687
4012NM15002
4012NM14301

Upvotes: 0

choroba
choroba

Reputation: 241828

You are printing the line for each input line, not for non-existent lines only. Move the print before the assignment paragraph and change it to

print EXISTS "$ID\n" unless exists $file1{$ID};

Upvotes: 4

Related Questions