Reputation: 85
my input from a file looks like this: The file has tab as delimiter and is sorted alphabetically for samples and numerically for the features in column 2 and 3. What I want to do is find overlapping and included features and merge them down to one feature.
SampleA 100 500
SampleA 200 600
SampleA 300 400
SampleA 700 800
SampleA 900 1100
SampleA 1200 1500
SampleA 1400 1700
SampleA 1600 1900
SampleB 400 600
SampleB 700 900
SampleB 1000 1800
SampleB 1500 1600
SampleB 1900 2500
SampleB 2500 2600
SampleB 3000 3600
SampleB 3100 3400
For example: The first three SampleA cases would become:
Sample A 100 600
My problem at the moment is that I can find the incidences when I iterate over my data structure but I am somewhat stuck when trying to merge my samples.
My idea was just to redo my loop until everything was found and merged but I am not sure on how to achieve this.
At the moment the data is stored in a 2D array like this:
@storage = [SampleA, start, stop]
my $j = 1;
for (my $i = 0; $i < scalar(@storage); $i++) {
if ($storage[$i][0] eq $storage[$j][0]) {
if ($storage[$i][2] > $storage[$j][1] && $storage[$i][2] < $storage[$j][2]) {
print "Found Overlapp!\n";
}elsif ( $storage[$i][2] > $storage[$j][1] && $storage[$i][2] > $storage[$j][2]) {
print "Found Feature in Feature!\n";
}
}
unless ($j == scalar(@storage)){$j++};
}
My goal would be to rerun this loop until no further match is found and thereby all intervals are non-overlapping but I am rather stuck here.
Upvotes: 0
Views: 240
Reputation: 18960
As you input is nicely sorted, you can filter it efficiently using only fixed memory.
$_ = <> or exit;
my @sample = split;
while (<>) {
my @newsample = split;
if ($sample[0] ne $newsample[0]
|| $newsample[2] < $sample[1]
|| $sample[2] < $newsample[1]) {
# Unmergeable sample
print "$sample[0]\t$sample[1]\t$sample[2]\n";
@sample = @newsample;
}
elsif ($sample[1] <= $newsample[1] && $newsample[2] <= $sample[2]) {
# @newsample is included in @sample. Nothing to do
}
elsif ($sample[1] <= $newsample[1]) {
# This @newsample raises the upper limit
$sample[2] = $newsample[2];
}
elsif ($newsample[2] <= $sample[2]) {
# This @newsample lowers the lower limit.
$sample[1] = $newsample[1];
}
else {
# This @newsample moves both limits
@sample = @newsample;
}
}
# Output the last sample
print "$sample[0]\t$sample[1]\t$sample[2]\n";
Upvotes: 3
Reputation: 53478
I think I'd do it like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my %ranges;
#iterate line by line.
while (<>) {
chomp;
#split by line
my ( $name, $start_range, $end_range ) = split;
#set a variable to see if it's within an existing range.
my $in_range = 0;
#iterate all the existing ones.
foreach my $range ( @{ $ranges{$name} } ) {
#merge if start or end is 'within' this range.
if (
( $start_range >= $range->{start} and $start_range <= $range->{end} )
or
( $end_range >= $range->{start} and $end_range <= $range->{end} )
)
{
## then the start or end is within the existing range, so add to it:
if ( $end_range > $range->{end} ) {
$range->{end} = $end_range;
}
if ( $start_range < $range->{start} ) {
$range->{start} = $start_range;
}
$in_range++;
}
}
#didn't find any matches, so create a new range identity.
if ( not $in_range ) {
push @{ $ranges{$name} }, { start => $start_range, end => $end_range };
}
}
print Dumper \%ranges;
#iterate by sample
foreach my $sample ( sort keys %ranges ) {
#iterate by range (sort by lowest start)
foreach
my $range ( sort { $a->{start} <=> $b->{start} } @{ $ranges{$sample} } )
{
print join "\t", $sample, $range->{start}, $range->{end}, "\n";
}
}
Outputs with your data:
SampleA 100 600
SampleA 700 800
SampleA 900 1100
SampleA 1200 1900
SampleB 700 900
SampleB 1000 1800
SampleB 1900 2600
SampleB 3000 3600
This probably isn't the most efficient algorithm though, because it checks all the ranges - but you probably don't need to, because the input data is ordered - you can just check the 'most recent' instead.
Upvotes: 2