Reputation: 327
I have 2 big files (tab delimited).
first file ->
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 101_#2 1 2 F0 263 278 2 1.5 102_#1 1 6 F1 766 781 1 1.0 103_#1 2 15 V1 526 581 1 0.0 103_#1 2 9 V2 124 134 1 1.3 104_#1 1 12 V3 137 172 1 1.0 105_#1 1 17 F2 766 771 1 1.0
second file ->
Col1 Col2 Col3 Col4 97486 9 262 279 67486 9 118 119 87486 9 183 185 248233 9 124 134
I want to compare col5 and col6 of file 1(like a range value) with col3 and col4 of file2. If the range of file 1 is present in file 2 then return that row (from file1).
Expected output ->
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 101_#2 1 2 F0 263 278 2 1.5 103_#1 2 9 V2 124 134 1 1.3
So far I have tried ->
@ARGV or die "No input file specified";
open my $first, '<',$ARGV[0] or die "Unable to open input file: $!";
open my $second,'<', $ARGV[1] or die "Unable to open input file: $!";
print scalar (<$first>);
while (<$first>) {
@cols = split /\s+/;
$p1 = $cols[4];
$p2 = $cols[5];
while(<$second>) {
@sec=split /\s+/;
print join("\t",@cols),"\n" if ($p1>=$sec[2] && $p2<=$sec[3]);
}
}
But this is working only for first row. Also the files are very big (around 6gb).
I just tried something with hashes.
@ARGV or die "No input file specified";
open my $first, '<',$ARGV[0] or die "Unable to open input file: $!";
open my $second,'<', $ARGV[1] or die "Unable to open input file: $!";
print scalar (<$first>);
while(<$second>){
chomp;
@line=split /\s+/;
$hash{$line[2]}=$line[3];
}
while (<$first>) {
@cols = split /\s+/;
$p1 = $cols[4];
$p2 = $cols[5];
foreach $key (sort keys %hash){
if ($p1>= "$key"){
if ($p2<=$hash{$key})
{
print join("\t",@cols),"\n";
}
}
else{next;}
}
}
But this is also taking a lot of time and memory.Can anybody suggest how I can make it fast using hashes.Thanks a lot.
Upvotes: 2
Views: 2200
Reputation: 31
Another solution that I found greatly speeds things up is using a subroutine: Suppose you are comparing the first and the second columns of both files, which was my intention. First, you need to sort both files by the first, then the second column.You thenread the first file ranges into an array and call a subroutine to do the matching in the second file and write the matching lines to a file while when a match is found. In the subroutine, you also save the line number where the last match was found so that perl directly goes to that line without delay! Note that I start by the first line in the second file.
use warnings;
use strict;
open my $first, '<', "first_file.txt" or die$!;
open my $second, '<', "second_file" or die$!;
open output, ">output.txt" or die$!;
my $line_number=1;
foreach (<$first>) {
my @cols=();
chomp $_;
my @cols = split( /\s+/, $_ );
my $p1 = $cols[0];
my $p2 = $cols[1];
match($p1,$p2,$line_number);
}
sub match{
while (<$second>) {
next if ($. < $line_number);
chomp $_;
my @list = @_;
my $p1=(@list[0]);
my $p2=(@list[1]);
my $line_number=(@list[2]);
my @sec = split( /\s+/, $_ );
if ( $p1 == $sec[0] && $p2 == $sec[1] ) {
print output2 $_."\n";
return $line_number;
next;}
} }
Upvotes: 0
Reputation: 41
Take a look at http://search.cpan.org/dist/Data-Range-Compare-Stream/lib/Data/Range/Compare/Stream.pod
Here is an example based on your source files. The amazing thing is the perl script never gets bigger than a few mb in memory no matter how big the source files are! Just make sure you have Data::Range::Compare::Stream version 3.023 or higher!
Notes:
This script does a sort of your input files using an on disk merge sort. The on disk merge sort can take a long time with really big files. You can tune the performance by adjusting bucket_size argument to the Data::Range::Compare::Stream::Iterator::File::MergeSortAsc constructor. See: http://search.cpan.org/dist/Data-Range-Compare-Stream/lib/Data/Range/Compare/Stream/Iterator/File/MergeSortAsc.pod#OO_Methods for more details.
use Data::Range::Compare::Stream;
use Data::Range::Compare::Stream::Iterator::File::MergeSortAsc;
use Data::Range::Compare::Stream::Iterator::Compare::Asc;
use Data::Range::Compare::Stream::Iterator::Consolidate::OverlapAsColumn;
my $cmp=new Data::Range::Compare::Stream::Iterator::Compare::Asc;
sub parse_file_one {
my ($line)=@_;
my @list=split /\s+/,$line;
return [@list[4,5],$line]
}
sub parse_file_two {
my ($line)=@_;
my @list=split /\s+/,$line;
return [@list[2,3],$line]
}
sub range_to_line {
my ($range)=@_;
return $range->data;
}
my $file_one=new Data::Range::Compare::Stream::Iterator::File::MergeSortAsc(
result_to_line=>\&range_to_line,
parse_line=>\&parse_file_one,
filename=>'custom_file_1.src',
);
my $file_two=new Data::Range::Compare::Stream::Iterator::File::MergeSortAsc(
result_to_line=>\&range_to_line,
parse_line=>\&parse_file_two,
filename=>'custom_file_2.src',
);
my $set_one=new Data::Range::Compare::Stream::Iterator::Consolidate::OverlapAsColumn(
$file_one,
$cmp
);
my $set_two=new Data::Range::Compare::Stream::Iterator::Consolidate::OverlapAsColumn(
$file_two,
$cmp
);
$cmp->add_consolidator($set_one);
$cmp->add_consolidator($set_two);
while($cmp->has_next) {
my $result=$cmp->get_next;
next if $result->is_empty;
my $ref=$result->get_root_results;
next if $#{$ref->[0]}==-1;
next if $#{$ref->[1]}==-1;
foreach my $overlap (@{$ref->[0]}) {
print $overlap->get_common->data;
}
}
The only quirk is the output would be in a different order:
103_#1 2 9 V2 124 134 1 1.3
101_#2 1 2 F0 263 278 2 1.5
Upvotes: 1
Reputation: 107040
You realize with your double loop that you're creating an algorithm that's O2 in efficiency. For example, if both file contain 100 lines each file, you'll be looping through your inner loop 10,000. If both files contain 1000 items, you'll be taking not 10 times longer, but 1000 times longer. If these files are as big as you claim, you're going to be waiting a long, long time for your program to complete.
Your best bet is to put your data into a SQL database (something that's made for dealing with large data sources).
Otherwise, you'll have to store your first file in a format where you can quickly search for the right range -- such as a binary tree.
Store your first file as a binary tree based upon the low range, but storing both the low and high range in the binary tree nodes for comparison.
For each line in the second file, you'd search through your binary tree for the correct lower range, compare the higher range, and if it's a match, you've found your node.
This is way too complex for me to write out a quick algorithm. However, there are several binary tree modules in CPAN which should make storing and searching your tree much easier. Unfortunately, I've never used one, so I can't make a recommendation. However, you should probably find a balanced tree algorithm like Tree::AVL.
Using such a structure is certainly more complex than your double loop, but it's much, much faster. With the efficiency will be a bit more than the size of the two files combined.
Another possibility is to sort the two files into two separate arrays. Perl's sorting algorithm is somewhere around OlogO which is much more efficient than a double loop, but not as efficient as building a binary tree. However, if the two files are more or less already in order, it'll be closer to the binary tree in efficiency and a lot faster to implement.
If you sort both arrays, you can go down sequentially in file #2, and find the row in file #1. Since both files are in order, you don't have to start at the beginning of file #1 when searching for the next matching row in file #2.
Hope this helps. Sorry about no coding examples.
Upvotes: 0
Reputation: 126722
You're trying to read through the second file again when it's already at end of file. To get this to work you need to write seek $second, 0, 0
just before the inner while
loop.
However this method will be extremely slow, and it would improve things vastly if you were to read all the ranges from the second file into memory first. This code does that. I suggest you try it to see if it will work within your available memory.
use strict;
use warnings;
use List::Util;
my @ranges;
open my $fh, '<', 'f2.txt' or die $!;
while (<$fh>) {
my ($beg, $end) = (split)[2,3];
next if $beg =~ /\D/ or $end =~ /\D/;
push @ranges, [$beg, $end];
}
open $fh, '<', 'f1.txt' or die $!;
while (<$fh>) {
my ($beg, $end) = (split)[4,5];
next if $beg =~ /\D/ or $end =~ /\D/;
print if first { $beg >= $_->[0] and $end <= $_->[1] } @ranges;
}
Upvotes: 1
Reputation: 1146
This seems to work just fine (and it is fairly close to your original code)
@ARGV or die "No input file specified";
open my $first, '<', $ARGV[0] or die "Unable to open input file: $!";
open my $second, '<', $ARGV[1] or die "Unable to open input file: $!";
print scalar(<$first>);
my $secondHeader = <$second>;
while (<$first>) {
@cols = split /\s+/;
$p1 = $cols[4];
$p2 = $cols[5];
my $secondLine = <$second>;
if ( defined $secondLine ) {
@sec = split( /\s+/, $secondLine );
print join( "\t", @cols ), "\n" if ( $p1 >= $sec[2] && $p2 <= $sec[3] );
}
}
Upvotes: 0
Reputation: 7516
Your reading the whole second file as soon as you read the first file's second record. Change:
while(<$second>) {
to something like:
if (defined($_ = <$second>)) {
So you have:
#!/usr/bin/env perl
use strict;
use warnings;
my ( @cols, $p1, $p2, @sec );
@ARGV or die "No input file specified";
open my $first , '<',$ARGV[0] or die "Unable to open input file: $!";
open my $second,'<', $ARGV[1] or die "Unable to open input file: $!";
print scalar <$first>;
<$second>; #...throw away first line...
while (<$first>) {
@cols = split /\s+/;
$p1 = $cols[4];
$p2 = $cols[5];
if (defined($_ = <$second>)) {
@sec=split /\s+/;
print join("\t",@cols),"\n" if ($p1>=$sec[2] && $p2<=$sec[3]);
}
}
Upvotes: 0
Reputation: 753525
This is basic 'query optimization' in the sense that an SQL optimizer does it. You have a variety of options.
One option is to read File1 a line at a time, and to read through File2 for each line of File1, printing the matching data. Clearly, this is slow. It is not the slowest way: that reads each line of File2 in turn and scans File1 (the larger file) for matches. This technique works regardless of the order of the content in the files.
Another option, which does not rely on the data being ordered either, is to read the smaller file into memory, and then read the larger file a line at a time, pulling the matching data. In the simplest form, you use a linear search of the in-memory data; it would be better to organize it so that it is possible to stop the search through the in-memory data more quickly (perhaps sorted by Col3 values, secondarily on Col4 values).
If the data on disk is already appropriately sorted, then you may be able to do without one of the files in memory, and simply do a merge-like operation on the files. You'd probably want File1 sorted in order of Col5 (secondarily on Col6), while File2 would be sorted in order of Col3 and Col4. This reduces the amount of data in memory, at the cost of pre-sorting the data. You'd need to think this through carefully: you're aiming to avoid reading too much data into memory, but because the matching condition is on ranges, you are likely to need to keep some number of lines from at least one of the files in memory for reuse.
If you have memory enough and the data is not pre-sorted, you might decide to read both files into memory, sort appropriately, and then do the merge options.
Since you are sorting in ranges, you might, in theory, get involved with a R-Tree indexing mechanism instead. However, that would probably be overkill for a couple text files, unless you're going to be doing this often.
Finally, since I identified this as the sort of thing the SQL optimizers do all the time, you might do best to load an actual database with the data and then run the query:
SELECT F1.*, F2.*
FROM File1 AS F1 JOIN File2 AS F2
ON F1.Col5 <= F2.Col4 AND F1.Col6 >= F2.Col3
The condition tests that F1.Col5 .. F1.Col6 overlaps with F2.Col3 .. F2.Col4. It assumes that if you have [129..145] and [145..163], then you need the match. If that's incorrect, adjust the <=
and >=
appropriately. See How do I compare overlapping values in a row and more particularly Determine whether two date ranges overlap. Although both are talking about dates and times, the answers apply to numerical ranges just as well (or to any other range).
Of the options outlined, the simplest one with a reasonable performance characteristic is the second:
However, if there are memory constraints, or time constraints, that prevent this working, then you'll need to choose one of the other mechanisms.
Upvotes: 0