Reputation: 19
I have a problem that I can't seem to find an answer for.
I have a CSV file that contains performance records for different individuals. There is only suppose to be one record per individual, however, there are some individuals that have several records with different information. I would like to compare this first file to another that also has a list of individuals, though I would only like to compare whether the individual in file 1 also has a record in file two (file 2 does not have duplicates). The individuals' IDs are unique.
Example for file 1:
ID number A B C D
4011NM16001 apple 24 sunday 2016-01-01
4011NM16001 apple 16 wednesday 2016-01-01
4012NM15687 pear 16 sunday 2015-04-19
4012NM15002 banana 8 monday 2015-09-09
4012NM14301 peach 10 wednesday 2014-03-18
4012NM14301 peach 18 wednesday 2014-03-18
I have opened the first file and tried to put the data into a hash (or rather a combination of a hash and array if I understand the concepts correctly) so as to remove the duplicates, using the ID as the unique key. However, instead of overwriting entries with the same ID, it still seems to add it, so I still end up with the duplicate records.
I want to see this:
ID number
4011NM16001
4011NM15687
4012NM15002
4012NM14301
But instead I still see this:
ID number
4011NM16001
4011NM16001
4012NM15687
4012NM15002
4012NM14301
4012NM14301
Have I typed something wrong in my code or am I not using the hash correctly? I'm still new to Perl so I use parts of previous programs and try to learn as I go..
#!/usr/bin/env perl
use DBI;
use strict;
use warnings;
my $file1 = 'location1.csv'; #file1 containing records with duplicates
my $exists = 'location3.csv'; #output file with unique IDs that will be compared to file2
open (EXISTS, ">$exists") or die "Cannot open $exists";
print EXISTS "ID number\n";
open (FILE1, "$file1") or die "Cannot open $file1";
while (<FILE1>){
my %file1;
my $line = $_;
$line =~ s/\s*$//g;
my ($ID, $a, $b, $c, $d) = split('\,', $line);
next if !$ID or substr($ID,0,2) eq 'ID';
$file1{$ID}[0]=$ID; #unique ID number
$file1{$ID}[1]=$a; #record a
$file1{$ID}[2]-$b; #record b
$file1{$ID}[3]=$c; #record c
$file1{$ID}[4]=$d; #record d
print EXISTS "$file1{$ID}[0]\n";
}
exit;
Upvotes: 0
Views: 575
Reputation: 126722
In addition to choroba's diagnosis you need to declare the hash outside the while
loop, otherwise each iteration of the loop is dealing with a new empty hash
Here's a version of your code that uses best-practice Perl and produces the result that you wanted. Note that I've had to alter the format of your input file location1.csv
as the values you show don't contain any commas
#!/usr/bin/env perl
use strict;
use warnings;
my $file1 = 'location1.csv'; # file1 containing records with duplicates
my $exists = 'location3.csv'; # output file with unique IDs that will be compared to file2
open my $exists_fh, '>', $exists or die qq{Unable to open "$exists" for output: $!};
print $exists_fh "ID number\n";
open my $file1_fh, '<', $file1 or die qq{Unable to open "$file1" for input: $!};
<$file1_fh>; # skip header line
my %file1;
while ( <$file1_fh> ) {
next unless /\S/; # Skip blank lines
s/\s+\z//;
my @fields = split /,/;
my $id = $fields[0];
next if $file1{$id}; # Skip this record if the ID is already known
$file1{$id} = \@fields;
print $exists_fh "$id\n"
}
ID number
4011NM16001
4012NM15687
4012NM15002
4012NM14301
Upvotes: 0
Reputation: 241828
You are printing the line for each input line, not for non-existent lines only.
Move the print
before the assignment paragraph and change it to
print EXISTS "$ID\n" unless exists $file1{$ID};
Upvotes: 4