chrys
chrys

Reputation: 85

Finding and merging down intervals in Perl

my input from a file looks like this: The file has tab as delimiter and is sorted alphabetically for samples and numerically for the features in column 2 and 3. What I want to do is find overlapping and included features and merge them down to one feature.

SampleA 100 500
SampleA 200 600
SampleA 300 400
SampleA 700 800
SampleA 900 1100
SampleA 1200    1500
SampleA 1400    1700
SampleA 1600    1900
SampleB 400 600 
SampleB 700 900 
SampleB 1000    1800    
SampleB 1500    1600
SampleB 1900    2500    
SampleB 2500    2600    
SampleB 3000    3600    
SampleB 3100    3400

For example: The first three SampleA cases would become:

Sample A 100 600

My problem at the moment is that I can find the incidences when I iterate over my data structure but I am somewhat stuck when trying to merge my samples. My idea was just to redo my loop until everything was found and merged but I am not sure on how to achieve this. At the moment the data is stored in a 2D array like this: @storage = [SampleA, start, stop]

my $j = 1;

for (my $i = 0; $i < scalar(@storage); $i++) {

    if ($storage[$i][0] eq $storage[$j][0]) {

        if ($storage[$i][2] > $storage[$j][1] && $storage[$i][2] < $storage[$j][2]) {

            print "Found Overlapp!\n";

        }elsif ( $storage[$i][2] > $storage[$j][1] && $storage[$i][2] > $storage[$j][2]) {

            print "Found Feature in Feature!\n";
        }
    }
    unless ($j == scalar(@storage)){$j++};
}

My goal would be to rerun this loop until no further match is found and thereby all intervals are non-overlapping but I am rather stuck here.

Upvotes: 0

Views: 240

Answers (2)

kmkaplan
kmkaplan

Reputation: 18960

As you input is nicely sorted, you can filter it efficiently using only fixed memory.

$_ = <> or exit;
my @sample = split;
while (<>) {
    my @newsample = split;
    if ($sample[0] ne $newsample[0]
        || $newsample[2] < $sample[1]
        || $sample[2] < $newsample[1]) {
        # Unmergeable sample
        print "$sample[0]\t$sample[1]\t$sample[2]\n";
        @sample = @newsample;
    }
    elsif ($sample[1] <= $newsample[1] && $newsample[2] <= $sample[2]) {
        # @newsample is included in @sample. Nothing to do
    }
    elsif ($sample[1] <= $newsample[1]) {
        # This @newsample raises the upper limit
        $sample[2] = $newsample[2];
    }
    elsif ($newsample[2] <= $sample[2]) {
        # This @newsample lowers the lower limit.
        $sample[1] = $newsample[1];
    }
    else {
        # This @newsample moves both limits
        @sample = @newsample;
    }
}
# Output the last sample
print "$sample[0]\t$sample[1]\t$sample[2]\n";

Upvotes: 3

Sobrique
Sobrique

Reputation: 53478

I think I'd do it like this:

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dumper;

my %ranges;

#iterate line by line. 
while (<>) {
   chomp;
   #split by line
   my ( $name, $start_range, $end_range ) = split;
   #set a variable to see if it's within an existing range. 
   my $in_range = 0;
   #iterate all the existing ones. 
   foreach my $range ( @{ $ranges{$name} } ) {

      #merge if start or end is 'within' this range. 
      if (
         ( $start_range >= $range->{start} and $start_range <= $range->{end} )

         or

         ( $end_range >= $range->{start} and $end_range <= $range->{end} )
        )
      {


         ## then the start or end is within the existing range, so add to it:
         if ( $end_range > $range->{end} ) {
            $range->{end} = $end_range;
         }
         if ( $start_range < $range->{start} ) {
            $range->{start} = $start_range;
         }
         $in_range++;
      }

   }
   #didn't find any matches, so create a new range identity. 
   if ( not $in_range ) {
      push @{ $ranges{$name} }, { start => $start_range, end => $end_range };
   }
}

print Dumper \%ranges;

#iterate by sample
foreach my $sample ( sort keys %ranges ) {
   #iterate by range (sort by lowest start)
   foreach
     my $range ( sort { $a->{start} <=> $b->{start} } @{ $ranges{$sample} } )
   {
      print join "\t", $sample, $range->{start}, $range->{end}, "\n";
   }
}

Outputs with your data:

SampleA 100 600 
SampleA 700 800 
SampleA 900 1100    
SampleA 1200    1900    
SampleB 700 900 
SampleB 1000    1800    
SampleB 1900    2600    
SampleB 3000    3600    

This probably isn't the most efficient algorithm though, because it checks all the ranges - but you probably don't need to, because the input data is ordered - you can just check the 'most recent' instead.

Upvotes: 2

Related Questions