neversaint
neversaint

Reputation: 64004

Finding Base Overlap Counts and Internal Gap in Two Strings

I have this two strings of equal length, which I need to compare. I want to find overlap base(.) and internal gap (*). Below is the example:

------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC
-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---
      ................**.................

Number of overlap = 33. Number of internal gap = 2.

I have no problem finding the number of overlap. But I have problem finding internal gap. Below is the current code I have. It is horribly slow. In principle I need to compute millions of such pairs.

#!/usr/bin/perl -w
my $s1 = "------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC";
my $s2 = "-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---";

print "$s1\n";
print "$s2\n";


my %base = ("A" => 1, "T" => 1, "C" => 1, "G" => 1);

my $ovlp_basecount = 0;
my $internal_gap = 0;

foreach my $si ( 0 .. length($s1)  ) {


    my $base1 = substr($s1,$si,1);
    my $base2 = substr($s2,$si,1);


    # Overlap
    if ( $base{$base1} && $base{$base2} ) {
        $ovlp_basecount++;
    }

    # Not sure how to compute internal gap

}


print "TOTAL OVERLAP BASE = $ovlp_basecount\n";
print "TOTAL Internal Gap \?\n";

Please advice how can I find internal gap and overlap efficiently.

Upvotes: 1

Views: 725

Answers (2)

Mike
Mike

Reputation: 21659

You can use a bitwise OR on the strings to find the the areas in one string that overlap blank areas in the other. This process also has the effect of revealing the overlap by converting non-overlapping characters to lower case, thus making finding the overlap quite simple too:

#!/usr/bin/perl

use strict;
use warnings;

my $s1 = "------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC";
my $s2 = "-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---";

$s1 =~ tr/-/\x20/;
$s2 =~ tr/-/\x20/;
my $or = $s1 | $s2;
(my $gap) = $or =~ m/^.*[ACTG]([actg]+)[ACTG].*$/;
(my $overlap = $or) =~ s/[^A-Z]//g;

print "s1:      '$s1'\n";
print "s2:      '$s2'\n";
print "OR:      '$or'\n";
printf "Gap:     '%s' (%d)\n", $gap,     length $gap;
printf "Overlap  '%s' (%d)\n", $overlap, length $overlap;

Prints:

s1:      '      ACTAAAAATACAAAAA  TTAGCCAGGCGTGGTGGCAC'
s2:      '     TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG   '
OR:      '     tACTAAAAATACAAAAAaaTTAGCCAGGWGTGGTGGcac'
Gap:     'aa' (2)
Overlap  'ACTAAAAATACAAAAATTAGCCAGGWGTGGTGG' (33)

For more information on string bitwise operations:

http://teaching.idallen.com/cst8214/08w/notes/bit_operations.txt

Upvotes: 3

moinudin
moinudin

Reputation: 138347

Assuming the gaps never overlap, you can solve this using regular expressions. Here's an answer for your s1.

echo '------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC' | perl -ne '$s = 0; foreach(/[GTAC](-+)[GTAC]/) { $s += length($1); } print "$s\n";'
2

Upvotes: 1

Related Questions