Reputation: 618
How to compare All elements of a file with All elements of another file using C or Perl for much larger data? File 1 contains 100,000 such numbers and file 2 contains 500,000 elements.
I used foreach within a foreach with spliting each and every element in arrays. It worked perfectly in perl but the time consumed to check and print every occurrence of elements of just a single column from File2 in file1 was 40 minutes. There are 28 such columns.
Is there any way to reduce the time or using another language like C?
0.1
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.2
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.1 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.2 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28
If element in file 2, matched print 'column number' if not print '0'.
1 2 0 0 0 0 0 0 0 10 11 12 13 14 15 16 17 18 19 20 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here is the code I am using. Note: it checks File2 columnwise in File 1 and prints the Column Number if true and '0' if false. It will print output for every column in 28 different files.
#!/usr/bin/perl-w
chomp($file = "File1.txt");
open(FH, $file);
@k_org = <FH>;
chomp($hspfile = 'file2.txt');
open(FH1, $hspfile);
@hsporg = <FH1>;
for $z (1 .. 28) {
open(OUT, ">$z.txt");
foreach (@hsporg) {
$i = 0;
@h_org = split('\t', $_);
chomp ($h_org[0]);
foreach(@k_org) {
@orginfo = split('\t', $_);
chomp($orginfo[0]);
if($h_org[0] eq $orginfo[0]) {
print OUT "$z\n";
$i = 1;
goto LABEL;
} elsif ($h_org[0] ne $orginfo[0]) {
if($h_org[0]=~/(\w+\s\w+)\s/) {
if($orginfo[0] eq $1) {
print OUT "0";
$i = 1;
goto LABEL;
}
}
}
}
if ($i == 0) {
print OUT "0";
}
LABEL:
}
}
close FH;
close FH1;
close OUT;
Upvotes: 2
Views: 426
Reputation: 57600
This script runs a test case. Note that your expected output is objectively wrong: In file 2, line 1, column 20, the value 0.2
exists.
#!perl
use 5.010; # just for `say`
use strict; use warnings;
use Test::More;
# define input files + expected outcome
my $file_1_contents = <<'_FILE1_';
0.1
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.2
_FILE1_
my $file_2_contents = <<'_FILE2_';
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.1 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.2 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28
_FILE2_
my $expected_output = <<'_OUTPUT_';
1 2 0 0 0 0 0 0 0 10 11 12 13 14 15 16 17 18 19 20 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
_OUTPUT_
# open the filehandles
open my $file1, "<", \$file_1_contents or die "$!";
open my $file2, "<", \$file_2_contents or die "$!";
open my $expected, "<", \$expected_output or die "$!";
my %file1 = map { chomp; 0+$_ => undef } <$file1>;
while (<$file2>) {
chomp;
my @vals = split;
# If value exists in file1, print the col number.
my $line = join " " => map { exists $file1{0+$vals[$_]} ? $_+1 : 0 } 0 .. $#vals;
chomp(my $expected_line = <$expected>);
is $line, $expected_line;
}
done_testing;
To print the exact same output to 28 files, you would remove the testing code, and rather
say {$_} $line for @filehandles;
instead.
Your existing code is simply quite inefficient and unidiomatic. Let me help you fix that.
First, start all your Perl scripts with use strict; use warnings;
, and if you have a modern perl (v10 or later), you could do use 5.010;
(or whatever your version is) to activate additional features.
The chomp
call takes a variable and removes the current value of $/
(usually a newline) from the end of the string. This is important because the readline operator doesn't do that for us. It is not good for declaring a constant variable. Rather, do
my $file = "File1.txt";
my $hspfle = "File2.txt";
The use strict
forces you to declare your variables properly, you can do so with my
.
To open a file, you should use the following idiom:
open my $fh, "<", $filename or die "Can't open $filename: $!";
Instead of or die ...
you can use autodie
at the top of your script.
This will abort the script if you can't open the file, tell you the reason ($!
), and specifies an explicit open mode (here: <
= read). This avoids bugs with special characters in filenames.
Lexical filehandles (in my
variables, as contrasted to bareword filehandles) have proper scope, and close themselves. There are various other reasons why you should use them.
The split
function takes a regex, not a string as first argument. If you inspect your program closely, you will see that you split
each element in @hsporg
28 times, and each element in @k_org
28 × @hsporg times. This is extremely slow, and unneccessary, as we can do that beforehand.
If a condition is false, you don't need to explicitely test for the falseness again in
if ($h_org[0] eq $orginfo[0]) {
...;
} elsif ($h_org[0] ne $orginfo[0]) {
...;
}
as $a ne $b
is exactly equivalent to not $a eq $b
.
It is quite unidiomatic in Perl to use a goto
, and jumping to a label somewhere isn't especially fast either. Labels are mainly used for loop control:
# random example
LOOP: for my $i (1 .. 10) {
for my $j (1 .. 5) {
next if $i == $j; # start next iteration of current loop
next LOOP if 2 * $i == $j; # start next iteration of labeled loop
last LOOP if $i + $j == 13;# like `break` in C
}
The redo
loop control verb is similar to next
, but doesn't recheck the loop condition, if there is one.
Because of these loop control facilities, and the ability to break out of any enclosing loop, maintaining flags or elaborate gotos is often quite unneccessary.
Here is a cleaned-up version of your script, without fixing too much of the actual algorithm:
#!/usr/bin/perl
use strict; use warnings;
use autodie; # automatic error messages
my ($file, $hspfile) = ("File1.txt", "file2.txt");
open my $fh1, "<", $file;
open my $fh2, "<", $hspfile;
my @k_org = <$fh1>;
my @hsporg = <$fh2>;
# Presplit the contents of the arrays:
for my $arr (\@k_org, \@hsporg) {
for (@$arr) {
chomp;
$_ = [ split /\t/ ]; # put an *anonymous arrayref* into each slot
}
}
my $output_files = 28;
for my $z (1 .. $output_files) {
open my $out, ">", "$z.txt";
H_ORG:
for my $h_org (@hsporg) {
my $i = 0;
ORGINFO:
for my $orginfo (@k_org) {
# elements in array references are accessed like $arrayref->[$i]
if($h_org->[0] eq $orginfo->[0]) {
print $out "$z\n";
$i = 1;
last ORGINFO; # break out of this loop
} elsif($h_org->[0] =~ /(\w+\s\w+)\s/ and $orginfo->[0] eq $1) {
print $out "0";
$i = 1;
last ORGINFO;
}
}
print $out "0" if not $i;
}
}
# filehandles are closed automatically.
Now we can optimize further: In each line, you only ever use the first element. This means we don't have to store the rest:
...;
for (@$arr) {
chomp;
$_ = (split /\t/, $_, 2)[0]; # save just the first element
}
...;
ORGINFO:
for my $orginfo (@k_org) {
# elements in array references are accessed like $arrayref->[$i]
if($h_org eq $orginfo) {
...;
} elsif($h_org =~ /(\w+\s\w+)\s/ and $orginfo eq $1) {
...;
}
}
Also, acessing scalars is a bit faster than accessing array elements.
The third arg to split
limits the number of resulting fragments. Because we are only interested in the first field, we don't have to split the rest too.
Next on, we last
out of the ORGINFO
loop, then check a flag. This is unneccessary: We can jump directly to the next iteration of the H_ORG
loop instead of setting the flag. If we drop out naturally out of the ORGINFO
loop, the flag is guaranteed to not be set, so we can execute the print
anyway:
H_ORG:
for my $h_org (@hsporg) {
for my $orginfo (@k_org) {
if($h_org eq $orginfo) {
print $out "$z\n";
next H_ORG;
} elsif($h_org =~ /(\w+\s\w+)\s/ and $orginfo eq $1) {
print $out "0";
next H_ORG;
}
}
print $out "0";
}
Then, you compare the same data 28 times to print it to different files. Better: Define two subs print_index
and print_zero
. Inside these, you loop over the output filehandles:
# make this initialization *before* you use the subs!
my @filehandles = map {open my $fh, ">", "$_.txt"; $fh} 1 .. $output_files;
...; # the H_ORG loop
sub print_index {
for my $i (0 .. $#filehandles) {
print {$filehandles[$i]} $i+1, "\n";
}
}
sub print_zero {
print {$_} 0 for @filehandles;
}
Then:
# no enclosing $z loop!
H_ORG:
for my $h_org (@hsporg) {
for my $orginfo (@k_org) {
if($h_org eq $orginfo) {
print_index()
next H_ORG;
} elsif($h_org =~ /(\w+\s\w+)\s/ and $orginfo eq $1) {
print_zero();
next H_ORG;
}
}
print_zero();
}
This avoids checking data you already know doesn't match.
Upvotes: 3
Reputation: 33273
If you sort(1)
the files you can then check it in a single pass. Should not take more than a couple of seconds (including the sort).
Another way is to load all values from file1 into a hash. It's a bit more memory-consuming, especially if file1 is large, but should be fast (again, no more than a couple of seconds).
I would choose perl over C for such a job, and I'm more proficient in C than in perl. It's much faster to code in perl for this kind of job, less error-prone and runs fast enough.
Upvotes: 5
Reputation: 1139
In C you could try using "qsort" and "bsearch" functions
First you need to load your files into an array.
Than you should execute a qsort() (unless you are sure the elements have an order). And use the bsearch() to execute a binary search into your array.
http://linux.die.net/man/3/bsearch
This will be much more faster than check all elements one by one.
You could try to implement a binary search in perl if it do not exist already, it is a simple algorithm.
Upvotes: 1