Reputation: 969
I want to count the number of common lines that exist between 2 files using Perl.
I have 1 base file used to compare if all the lines (separated by a newline \n) exist in fileA. What I have done is to put all the lines from the base file into a base_config hash and the lines from fileA into config hash. I want to compare that for all the keys in the %config, it can also be found in the keys of %base_config. To make it more efficient to compare the keys, I have sorted the keys in %base_config and put them into @sorted_base_config.
However, for some files that has exactly the same lines but in different order, I am not able to get the correct count. For example, base file contains:
hello
hi
tired
sleepy
whereas fileA contains:
hi
tired
sleepy
hello
I am able to read in the values from the files and placed them into their respective hashes and arrays. Here is the part of the code that went wrong:
$count=0;
while(($key,$value)=each(%config))
{
foreach (@sorted_base_config)
{
print "config: $config{$key}\n";
print "\$_: $_\n";
if($config{$key} eq $_)
{
$count++;
}
}
}
Can someone please tell me if I have make any mistake? The count is suppose to be 4 but it keeps printing 2 all the time.
EDIT: Here's my original code that didn't work. It looks quite different because I tried to use different methods to fix the problem. However, I am still stuck at the same problem.
#open base config file and load them into the base_config hash
open BASE_CONFIG_FILE, "< script/base.txt" or die;
my %base_config;
while (my $line=<BASE_CONFIG_FILE>) {
(my $word1,my $word2) = split /\n/, $line;
$base_config{$word1} = $word1;
}
#sort BASE_CONFIG_FILE
@sorted_base_config = sort keys %base_config;
#open config file and load them into the config hash
open CONFIG_FILE, "< script/hello.txt" or die;
my %config;
while (my $line=<CONFIG_FILE>) {
(my $word1,my $word2) = split /\n/, $line;
$config{$word1} = $word1;
}
#sort CONFIG_FILE
@sorted_config = sort keys %config;
%common={};
$count=0;
while(($key,$value)=each(%config))
{
$num=keys(%base_config);
$num--;#to get the correct index
#print "$num\n";
while($num>=0)
{
#check if all the strings in BASE_CONFIG_FILE can be found in CONFIG_FILE
$common{$value}=$value if exists $base_config{$key};
#print "yes!\n" if exists $base_config{$key};
$num--;
}
}
print "count: $count\n";
while(($key,$value)=each(%common))
{
print "key: ".$key."\n";
print "value: ".$value."\n";
}
$num=keys(%common)-1;
print "common lines: ".$num;
Previously, I push the common keys that exist in both base_config file and fileA into %common. I wanted to print out the common keys into a txt file in future and whatever that is found in fileA but not found in base_config file will be output to another txt file. However, I am already stuck at the initial phase of finding the common keys.
I am using "\n" to split into keys for storing so I can't use chomp function that will remove "\n".
EDIT 2: I just realised what's wrong with my code. At the end of my txt files, I need to add "\n" to make it work. Thanks for all your help! :D
Upvotes: 0
Views: 1041
Reputation: 755094
I think your attempt at efficiency is actually slowing things down.
my %listA;
# Read first file (name in $NameA)
{
open my $fileA, '<', "$NameA" or die $!;
while (<$fileA>)
{
chomp;
$listA{$_}++;
}
}
# Read second file (name in $NameB)
{
open my $fileB, '<', "$NameB" or die $!;
while (<$fileB>)
{
chomp;
if ($listA{$_})
{
print "Line appears in $NameB once and $listA{$_} times in $NameA: $_\n";
}
}
}
If you want to read the second file into a hash too, then that also works:
Now, if a particular line appears in both files, it will be listed. Note that even though I present the keys in sorted order, I'm using the hash lookup because that will be quicker that shuffling through two sorted arrays. You'd be hard-pressed to measure any difference on 4-line files, of course. And with large files, the chances are that the I/O time reading the files and printing the results will dominate the lookup time.
my %listB;
# Read second file (name in $NameB)
{
open my $fileB, '<', "$NameB" or die $!;
while (<$fileB>)
{
chomp;
$listB{$_}++;
}
}
foreach my $key (sort keys %listA)
{
if ($listB{$key})
{
print "$NameA: $listA{$key}; $NameB: $listB{$key}; $key\n";
}
}
Reorganize the output as you wish.
Untested code! Code now tested - see below.
hello
hi
tired
sleepy
hi
tired
sleepy
hello
#!/usr/bin/env perl
use strict;
use warnings;
my $NameA = "fileA";
my $NameB = "fileB";
my %listA;
# Read first file (name in $NameA)
{
open my $fileA, '<', "$NameA" or die "Failed to open $NameA: $!\n";
while (<$fileA>)
{
chomp;
$listA{$_}++;
}
}
# Read second file (name in $NameB)
{
open my $fileB, '<', "$NameB" or die "Failed to open $NameB: $!\n";
while (<$fileB>)
{
chomp;
if ($listA{$_})
{
print "Line appears in $NameB once and $listA{$_} times in $NameA: $_\n";
}
}
}
$ perl ppp.pl
Line appears in fileB once and 1 times in fileA: hi
Line appears in fileB once and 1 times in fileA: tired
Line appears in fileB once and 1 times in fileA: sleepy
Line appears in fileB once and 1 times in fileA: hello
$
Note that this is listing things in the order of fileB, as it should given that the loop reads through fileB and checks each line in turn.
This is the second fragment turned into a complete working program.
#!/usr/bin/env perl
use strict;
use warnings;
my $NameA = "fileA";
my $NameB = "fileB";
my %listA;
# Read first file (name in $NameA)
{
open my $fileA, '<', "$NameA" or die "Failed to open $NameA: $!\n";
while (<$fileA>)
{
chomp;
$listA{$_}++;
}
}
my %listB;
# Read second file (name in $NameB)
{
open my $fileB, '<', "$NameB" or die "Failed to open $NameB: $!\n";
while (<$fileB>)
{
chomp;
$listB{$_}++;
}
}
foreach my $key (sort keys %listA)
{
if ($listB{$key})
{
print "$NameA: $listA{$key}; $NameB: $listB{$key}; $key\n";
}
}
$ perl qqq.pl
fileA: 1; fileB: 1; hello
fileA: 1; fileB: 1; hi
fileA: 1; fileB: 1; sleepy
fileA: 1; fileB: 1; tired
$
Note that the keys are listed in sorted order, which is not the order in either fileA or fileB.
Minor miracles occasionally happen! Apart from adding the 5 lines of preamble (shebang, 2 x using, 2 x my), the code for both program fragments worked correct according to my reckoning first time for both programs. (Oh, and I improved the error messages on failing to open the file, at least identifying which file I failed to open. And ikegami edited my code (thanks!) to add the chomp
calls consistently, and the newlines to the print
operations which now need the explicit newline.)
I would not claim this is great Perl code; it certainly won't win a (code) golfing contest. It does seem to work, though.
open BASE_CONFIG_FILE, "< script/base.txt" or die;
my %base_config;
while (my $line=<BASE_CONFIG_FILE>) {
(my $word1,my $word2) = split /\n/, $line;
$base_config{$word1} = $word1;
}
The split is odd...you have a line that ends with a newline, and you split at the newline, so $word2
is empty, and $word1
contains the rest of the line. You then store the value $word1
(not $word2
as I assumed at first glance) into the base configuration. So the key and the value are the same for each entry. Unusual. Not actually wrong, but ... unusual. The second loop is essentially the same (we should both be shot for not using a single sub to do the reading for us).
You can't be using use strict;
and use warnings;
- note that the practically the first thing I did with my code was add them. I've only been programming in Perl for about 20 years, and I know I don't know enough to risk running code without them. Your sorted arrays, %common
, $count
, $num
, $key
, $value
are not my
'd. It probably doesn't do much harm this time, but...it is a bad sign. Always, but always, use use strict; use warnings;
until you know enough about Perl not to need to ask questions about it (and don't expect that to be any time soon).
When I run it, at the point where there is:
my %common={}; # line 32 - I added diagnostic printing
my $count=0;
Perl tells me:
Reference found where even-sized list expected at rrr.pl line 32, <CONFIG_FILE> line 4.
Oops - those {}
should be an empty list ()
. See why you run with warnings enabled!
And then, at
50 while(my($key,$value)=each(%common))
51 {
52 print "key: ".$key."\n";
53 print "value: ".$value."\n";
54 }
Perl tells me:
key: HASH(0x100827720)
Use of uninitialized value $value in concatenation (.) or string at rrr.pl line 53, <CONFIG_FILE> line 4.
That's the first entry in %common
throwing things for a loop.
rrr.pl
#!/usr/bin/env perl
use strict;
use warnings;
#open base config file and load them into the base_config hash
open BASE_CONFIG_FILE, "< fileA" or die;
my %base_config;
while (my $line=<BASE_CONFIG_FILE>) {
(my $word1,my $word2) = split /\n/, $line;
$base_config{$word1} = $word1;
print "w1 = <<$word1>>; w2 = <<$word2>>\n";
}
{ print "First file:\n"; foreach my $key (sort keys %base_config) { print "$key => $base_config{$key}\n"; } }
#sort BASE_CONFIG_FILE
my @sorted_base_config = sort keys %base_config;
#open config file and load them into the config hash
open CONFIG_FILE, "< fileB" or die;
my %config;
while (my $line=<CONFIG_FILE>) {
(my $word1,my $word2) = split /\n/, $line;
$config{$word1} = $word1;
print "w1 = <<$word1>>; w2 = <<$word2>>\n";
}
#sort CONFIG_FILE
my @sorted_config = sort keys %config;
{ print "Second file:\n"; foreach my $key (sort keys %base_config) { print "$key => $base_config{$key}\n"; } }
my %common=();
my $count=0;
while(my($key,$value)=each(%config))
{
print "Loop: $key = $value\n";
my $num=keys(%base_config);
$num--;#to get the correct index
#print "$num\n";
while($num>=0)
{
#check if all the strings in BASE_CONFIG_FILE can be found in CONFIG_FILE
$common{$value}=$value if exists $base_config{$key};
#print "yes!\n" if exists $base_config{$key};
$num--;
}
}
print "count: $count\n";
while(my($key,$value)=each(%common))
{
print "key: $key -- value: $value\n";
}
my $num=keys(%common);
print "common lines: $num\n";
$ perl rrr.pl
w1 = <<hello>>; w2 = <<>>
w1 = <<hi>>; w2 = <<>>
w1 = <<tired>>; w2 = <<>>
w1 = <<sleepy>>; w2 = <<>>
First file:
hello => hello
hi => hi
sleepy => sleepy
tired => tired
w1 = <<hi>>; w2 = <<>>
w1 = <<tired>>; w2 = <<>>
w1 = <<sleepy>>; w2 = <<>>
w1 = <<hello>>; w2 = <<>>
Second file:
hello => hello
hi => hi
sleepy => sleepy
tired => tired
Loop: hi = hi
Loop: hello = hello
Loop: tired = tired
Loop: sleepy = sleepy
count: 0
key: hi -- value: hi
key: tired -- value: tired
key: hello -- value: hello
key: sleepy -- value: sleepy
common lines: 4
$
Upvotes: 3
Reputation: 1
Without seeing how you defined and populated the %config and @sorted_base_config variables I'm not sure what is causing your code to fail. If you provide the output of running the code you have above it would be more obvious.
Rather than providing a whole new approach as in the other answers, I tried "fixing" yours, but mine works with no issues. That would imply that the error is actually in how you populated the variables, rather than in how you are checking.
For simplicity in matching your code, I assigned both the key and the value to be what was read from the file.
This code:
#!C:\Perl\bin\perl
use strict;
use warnings;
my $f1 = $ARGV[0];
my $f2 = $ARGV[1];
my %config_base;
my %config;
my $line;
print "F1 = $f1\nF2 = $f2\n";
open F1, '<', $f1 || die;
while ($line = <F1>) {
chomp $line;
print "adding $line\n";
$config_base{$line}=$line;
}
close F1;
open F2, '<', $f2 || die;
while ($line = <F2>) {
chomp $line;
print "adding $line\n";
$config{$line}=$line;
}
close F2;
my $count=0;
my $key; my $value;
my @sorted_base_config = sort keys %config_base;
while(($key,$value)=each(%config))
{
foreach (@sorted_base_config)
{
print "config: $config{$key}\n";
print "\$_: $_\n";
if($config{$key} eq $_)
{
$count++;
}
}
}
print "Count = $count\n";
Results in the output:
F1 = config_base.txt
F2 = config.txt
adding hello
adding hi
adding tired
adding sleepy
adding hi
adding tired
adding sleepy
adding hello
config: hi
$_: hello
config: hi
$_: hi
config: hi
$_: sleepy
config: hi
$_: tired
config: hello
$_: hello
config: hello
$_: hi
config: hello
$_: sleepy
config: hello
$_: tired
config: tired
$_: hello
config: tired
$_: hi
config: tired
$_: sleepy
config: tired
$_: tired
config: sleepy
$_: hello
config: sleepy
$_: hi
config: sleepy
$_: sleepy
config: sleepy
$_: tired
Count = 4
However, Johnathan's answer is a better approach than what you started with. At the very least, using exists to compare the keys of the 2 input hashes is far better than a nested loop against an array of keys. The loop defeats the efficiency of using a hash to begin with.
In that case, you would have something like:
foreach my $key (keys %config_base)
{
print "config: $config{$key}\n";
print "\$_: $key\n";
if(exists $config{$key})
{
$count++;
}
}
print "Count = $count\n";
Upvotes: 0
Reputation: 7603
Maybe it's not the approach you are looking for, but what if you went about it more like this:
#!/usr/bin/perl
use Data::Dumper;
use warnings;
use strict;
my @sorted_base_config = qw(hello hi tired sleepy);
my @file_a = qw(hi tired sleepy hello);
my @found_in_both = ();
foreach (@sorted_base_config) {
if (grep /$_/, @file_a) {
push(@found_in_both, $_);
}
}
print "These items were found in file_a:\n";
print Dumper(@found_in_both);
Basically, instead of doing the key/value hash thing... why not try using two arrays and using foreach
for the base file array. As you go through each line of @sorted_base_config
you check to see if the string can be found in @file_a.
It's up to you as to how you want to get the files into the @sorted_base_config
and @file_a
arrays (and how to deal with newlines or line breaks.) But with this way, at least, it seems to get a more accurate check of what words match.
Upvotes: 0