Reputation: 127
Currently i have a requirement for comparing 2 tables each table having around 700M + records.
We have come up with a idea to compare in the File based rather than taking it to the DB , based upon our DB performance. Also heard that Perl is much faster than any other coding method
We need to compare for a Mobile number the Usage plan in one File to the same mobile Number in another File, either if present and its usage is matching or not , we need to write in a new file only the Not matching records.
Example:
File 1
number, Usage type , Usage Plan , Usage Volume (KB) ........
12344 , CP , FB , 100 ........
12323 , UP , FB , 200 ........
12322 , CP , G+ , 300 ........
File 2
number, Usage type , Usage Plan , Usage Volume (KB) ........
12344 , CP , FB , 100 ........
12323 , UP , FB , 210 ........
So in the above case my unmatch File should contain
Unmatch File
12323 , UP , FB , 210 ........
12322 , CP , G+ , 300 ........
........
Means there are many columns after this, which we will not be using to compare. They are rather like more details about the plan.
Please share your suggestions and coding idea on this.
Our aim is to complete the comparison within 6 - 7 hours.. so that loading and other things can be completed within 2 days..
Thanks in Advance.. Sam
Upvotes: 1
Views: 86
Reputation: 11
Here is another way that does not read the entire files into memory, which may be problem due to memory constraint. For instance, 700M record x 30 byte/rec = 21GB files.
It does requires the file to be sorted by the number when exported from DB. Assuming the number are increasing.
open FILE1, "file1";
open FILE2, "file2";
open OUT, ">out.txt";
$line1 = <FILE1>;
$line2 = <FILE2>;
sub number_part {
($line) = @_;
return $1 if $line =~ /^(\d{1..9})/;
}
while (1) {
if ($line1 eq $line2) {
$line1 = <FILE1>;
$line2 = <FILE2>;
} elsif ( number_part(line1) == number_part(line2) ) {
print OUT $line1;
print OUT $line2;
$line1 = <FILE1>;
$line2 = <FILE2>;
} elsif ( number_part($line1) < number_part($line2) ) {
print OUT $line1;
$line1 = <FILE1>;
} elsif ( number_part(line1) > number_part(line2) ) {
print OUT $line2;
$line2 = <FILE2>;
}
# Use a dummy record if EOF is reached for either file.
# Done when EOF is reached for both files.
$line1 = "9999999999" unless $line1;
$line2 = "9999999999" unless $line2
last if $line eq "999999999" and $line2 eq "9999999999";
}
close(FILE1);
close(FILE2);
close(OUT);
Upvotes: 1
Reputation: 133
How about this:
use strict;
open FILE1, 'file1.txt';
open FILE2, 'file2.txt';
open OUTPUT, '>output.txt';
my $regex = qr/^ *(\d+) , (.*) , (.*) , (\d+)/;
my $file1;
while(<FILE1>){
if(/$regex/){
$file1->{$1}->{type} = $2;
$file1->{$1}->{plan} = $3;
$file1->{$1}->{volume} = $4;
}
}
my $file2;
while(<FILE2>){
if(/$regex/){
$file2->{$1}->{type} = $2;
$file2->{$1}->{plan} = $3;
$file2->{$1}->{volume} = $4;
}
}
my $numbers;
$numbers->{$_} = 1 foreach keys %$file1;
$numbers->{$_} = 1 foreach keys %$file2;
my $output;
foreach(keys %$numbers){
if(defined $file1->{$_} && defined $file2->{$_}){
if($file1->{$_}->{type} ne $file2->{$_}->{type} || $file1->{$_}->{plan} ne $file2->{$_}->{plan} || $file1->{$_}->{volume} ne $file2->{$_}->{volume}){
push @$output, [$_, $file2->{$_}->{type}, $file2->{$_}->{plan}, $file2->{$_}->{volume}];
}
}elsif(defined $file1->{$_}){
push @$output, [$_, $file1->{$_}->{type}, $file1->{$_}->{plan}, $file1->{$_}->{volume}];
}else{
push @$output, [$_, $file2->{$_}->{type}, $file2->{$_}->{plan}, $file2->{$_}->{volume}];
}
}
print OUTPUT join(' , ', @$_)."\n" foreach @$output;
Upvotes: 0